computer architecture “the architecture of a computer is the interface between the machine and the...

Computer Architecture

“The architecture of a computer is the interface between the machine and the software”

- Andris Padges IBM 360/370

Architect

Course Outline Computer Architecture

QuarterAutumn 2006-7 Instructor Muhammad Jahangir

Ikram Office: Room 424 e-mail: [email protected] Office Hours: Monday and Wednesday, 3:00 –

4:30pm

Course Outline (Contd..)

DescriptionThis course focuses on the principles, practices and issues

in Computer Architecture, while examining computer design tradeoffs both qualitatively and quantitatively.

The course starts with a quick overview of computer design fundamentals and instruction set principles, the materials which the student has already covered in the pre-requisite of this course.

The following topics are covered in greater detail: Advanced Pipelining Instruction-level parallelism and Compiler Support Memory - hierarchy design SIMD, VLIW, Superscalar Architectures Code Optimization and Compiler Issues


Text BookHennessy, J. L, and Patterson, D. A.,

Computer Architecture: A Quantitative Approach, 2nd Edition. Morgan Kaufmann, 1996.


Lectures There will be two 75 minutes

lecturers per week and 50 minutes Lecture/ 100 minutes lab.

TOTAL SESSIONS = 29There will be four Labs during

weeks 2, 3, 4, 5.


Grading Quizzes & assignments 17+3% Laboratory 10% (Atten 3 + Lab Task 3 + HW 4)

Midterm exam 30% Final exam 40%

Schedule

Fundamentals of Computer Design 1,2 1.1 – 1.10

Measuring and Reporting Performance Quantitative Principles of Computer DesignInstruction Set Principles and Examples 3-5 2.1 – 2.8 Classifying Instruction Set Architectures Memory Addressing Operations in the Instruction Set Encoding an Instruction Set

LAB 1: MIPS Instruction Format and Instruction Study 6

Pipelining Overview 7-14 A.1 to A10

What Is Pipelining? Single Cycle Computer Study 9

The Major Hurdle of Pipelining – Pipeline Hazards Data Hazards

LAB 2: Study of Pipelining 12

Schedule

Control Hazards and Static Branch Prediction LAB 3: Pipeline Studies and Control Hazards 15

ScoreboardingMIDTERM

ILP and Dynamic Exploitation 17-19 3.1 – 3.5 Static Branch Prediction Tomasulo’s Dynamic Scheduling Dynamic Branch Prediction Superscalar and VLIW architecturesAdvanced Pipelining And ILP (Cont’d.) 20-22 3.6 – 3.10 Taking Advantage of More ILP with Multiple Issue P6 ArchitectureAdvanced Pipelining And ILP (Cont’d.) 23-25 4.1, 4.7 Compiler Support for Exploiting ILP Hardware Support for Extracting More Parallelism Putting It All Together: The PowerPC 620, and Itanium

Schedule

Memory-Hierarchy Design 26-295.1 – 5.7

The ABCs of Caches Reducing Cache Misses Reducing Cache Miss Penalty Virtual Memory SystemComputer I/O 30 6.1 - ?

Background

Emergence of the first microprocessor in late 1970’s

Roughly 35% growth per year Important changes in the marketplace:

Virtual elimination of assembly language programming reduced the need for object code compatibility

Creation of standardized, vendor-independent operating systems, such as UINX, LINX lowered the risk of bringing out a new architecture

Development of RISC

These changes lead to the development of a new set of architectures, called the RISC (Reduced Instruction Set Computer) architecture

RISC uses two performance techniques: Instruction level parallelism (pipelining) Use of Cache

Growth in microprocessor performance

Moore’s Law

Technology Scaling

Scaling of Transistors Feature Size has reduced to 3 micron

in 1985 to 0.09 micron. Reducing Feature-size means

quadratic increase in Transistor Count and better Performance.

But higher routing Delays and poor performance of Long Wires

Also means More Power Consumption (Less load Capacitance)

The Itanium Processor

Intel microprocessor die

IC Cost Trends (Source: IC Knowledge)

Measuring performance

Definition of time: Response time, elapse time: The latency to

complete the task, including disk access, input/output, operating system overhead etc.

CPU time: User CPU Time

Time spent in the program System CPU Time:

Time Spent by operating system.

Unix Time Command: 90.7s 12.9s 2:39 (159s) 65% (90.7+12.9)/159

(User, System, Elapsed Time)

What is a Benchmark?

A benchmark is "a standard of measurement or evaluation" (Webster’s II Dictionary).

A computer benchmark is typically a computer program that performs a strictly defined set of operations - a workload - and returns some form of result - a metric - describing how the tested computer performed.

Computer benchmark metrics usually measure speed: how fast was the workload completed; or throughput: how many workload units per unit time were completed.

Running the same computer benchmark on multiple computers allows a comparison to be made.

Source: Standards Performance Evaluation Corporation

Programs to Evaluate Performance

Real Applications

Modified (or scripted) applications

Kernels

Toy benchmarks

Synthetic benchmarks

Programs to evaluate performance

Real Applications Example: Compliers for C, text-

processing software etc.

Modified (or scripted) applications CPU oriented bench mark, I/O may be

removed to minimize its impact on execution

Programs to evaluate performance

Kernels To isolate performance of individual features of a

machine.

Toy benchmarks Produces a result that the user already knows

Synthetic benchmarks Try to match the average frequency of operations

and operands of a large set of programs

Benchmark Suites

SPEC95, SPEC2000 (11 Integer, 14 FP), SPEC2006 (12 Integer, 17 FP) C Compiler, Router, FEM Desktop (CPU and Graphics Intensive)

Server (File Servers, Web Servers, Transaction Processing)

Embedded (EEMBC) 34 Kernels

What is SPEC

SPEC is the Standard Performance Evaluation Corporation. SPEC is a non-profit organization whose members include computer hardware vendors, software companies, universities, research organizations, systems integrators, publishers and consultants. SPEC's goal is to establish, maintain and endorse a standardized set of relevant benchmarks for computer systems. Although no one set of tests can fully characterize overall system performance, SPEC believes that the user community benefits from objective tests which can serve as a common reference point.

What does a benchmark measure?

the computer processor (CPU), the memory architecture, and the compilers.

SPEC CPU2006 contains two components that focus on two different types of compute intensive performance:

The CINT2006 suite measures compute-intensive integer performance, and

The CFP2006 suite measures compute-intensive floating point performance

Source: Standards Performance Evaluation Corporation

Reference Machine Source: Standards Performance Evaluation Corporation

SPEC uses a historical Sun system, the "Ultra Enterprise 2" which was introduced in 1997, as the reference machine. The reference machine uses a 296 MHz UltraSPARC II processor, as did the reference machine for CPU2000. But the reference machines for the two suites are not identical: the CPU2006 reference machine has substantially better caches, and the CPU2000 reference machine could not have held enough memory to run CPU2006.

It takes about 12 days to do a rule-conforming run of the base metrics for CINT2006 and CFP2006 on the CPU2006 reference machine. SPEC2000 now takes less a minute on latest High Performance M/Cs

Example Result for SPEC 2000 Source: Standards Performance Evaluation Corporation

SYSTEMIntel SE440BX-2 (800 MHz Pentium III)

1 core, 1 chip, 1 core/chipBase340

Peak344

Intel D850GB motherboard(1.4 GHz, Pentium 4 processor)

1 core, 1 chip, 1 core/chip 502 512

Sun Blade 2500 (1.28GHz) 1 core, 1 chip, 1 core/chip 604 696

Intel D850EMV2 motherboard (2.0A GHz, Pentium 4 processor)


PowerEdge 2650 (3.06 GHz Xeon) DELL

1 core, 1 chip, 1 core/chip (Hyper-Threading

Technology disabled)

1014 1056

Precision WorkStation 350 (2.8 GHz P4) DELL


SGI Altix 3000 (1300MHz, Itanium 2)

1 core, 1 chip, 1 core/chip 1019 --

Example Result for SPEC 2000Source: Standards Performance Evaluation Corporation

SYSTEMPrecision Workstation 690 (Intel® Xeon® processor 5160, 3.0

#CPU4 cores, 2 chips,

2 cores/chip

BASE3057

PEAK3063

PowerEdge 1950 (Intel Xeon processor 5160, 3.00GHz)

4 cores, 2 chips, 2 cores/chip

3061 3065

Intel(R) DG965WH motherboard( 2.93 GHz, Intel(R) Core(TM) 2

2 cores, 1 chip, 2 cores/chip

3099 3109

Intel(R) DG965WH motherboard( 2.93 GHz, Intel(R) Core(TM) 2


3106 3111

Precision Workstation 390 (Intel Core 2 Extreme processor X6


3108 3119

Summarizing Performance

Amdahl’s Law

The performance improvement to be gained from using faster mode of execution is limited by the fraction of the time the faster mode can be used

Amdahl’s Law: Law of Diminishing Returns

possiblewhentenhancementhewithouttaskentitreforePerformanc

possiblewhentenhancementhewithtaskentitreforePerformancSpeedup

EnhancedSpeedup

EnhancedFractionEnhancedFractionoldimeExecutionTnewtimeExecution 1

EnhancedSpeedupEnhancedFraction

EnhancedFractionnewTimeExecution

OldTimeExecutionSpeedUp

1

1

CPU performance Equations

CPU Time = Instructions

Program

Clock Cycle

Instruction

Seconds

Clock Cycle

Example:

Frequency of FP operations = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR = 2% CPI of FPSQR = 20

Assume CPI of FPSQR decreased to 2 OR the CPI of all FP operations to 2.5

Compare these two designs using the CPU performance equations

Example: Solution

0.2%7533.1%254

i

n

i

iorignal CPI

CountnInstructio

ICCPI

1

CPI for enhanced FPSQR

onlyFPSQRnewFPSQRoldorignalFPSQR CPICPICPICPI %2

64.1220%20.2 CPI for enhanced FP operation

625.15.2%2533.1%75 newFPCPI

Example: Solution

newFP

orignal

newFP

orignalnewFP

CPIClockcycleIC

CPIClockcycleIC

CPUtime

CPUtimeSpeedup

23.1625.1

0.2

newFP

orignal

CPI

CPI

Another Measure -- MIPS

MIPS =

Instruction Count

Execution Time 106

Example:An Embedded Processor

120 MIPS for single processor. 80 MIPS for Processor –Co-Processor

Combination (That is how they are measured for combined)

I= Number of Integer Instructions F = Number of Floating Point

Instructions (8M) Y = No. of Integer Instructions to

Emulate one FP Instruction (50) W = Time for choice 1 (4 seconds) B = Time for Choice 2

End of Lecture 1

CINT 2006400.perlbench C PERL Programming Language

401.bzip2 C Compression

403.gcc C C Compiler

429.mcf C Combinatorial Optimization

445.gobmk C Artificial Intelligence: go

456.hmmer C Search Gene Sequence

458.sjeng C Artificial Intelligence: chess

462.libquantum C Physics: Quantum Computing

464.h264ref C Video Compression

471.omnetpp C++ Discrete Event Simulation

473.astar C++ Path-finding Algorithms

483.xalancbmk C++ XML Processing

CFP 2006410.bwaves Fortran Fluid Dynamics

416.gamess Fortran Quantum Chemistry

433.milc C Physics: Quantum Chromodynamics

434.zeusmp Fortran Physics/CFD

435.gromacs C/Fortran Biochemistry/Molecular Dynamics

436.cactusADM C/Fortran Physics/General Relativity

437.leslie3d Fortran Fluid Dynamics

444.namd C++ Biology/Molecular Dynamics

447.dealII C++ Finite Element Analysis

450.soplex C++ Linear Programming, Optimization

453.povray C++ Image Ray-tracing

454.calculix C/Fortran Structural Mechanics

459.GemsFDTD Fortran Computational Electromagnetics

465.tonto Fortran Quantum Chemistry

470.lbm C Fluid Dynamics

481.wrf C/Fortran Weather Prediction

482.sphinx3 C Speech recognition

computer architecture “the architecture of a computer is the interface between the machine and the...

Documents

computer architecturerisc

instruction study

classifying instruction

instruction setlab

computer design tradeoffs

risc reduced instruction

mips instruction format

minutes lab