1 csce 930 advanced computer architecture lecture 1 evaluate computer architectures dr. jun wang

1

CSCE 930 Advanced Computer Architecture

Lecture 1

Evaluate Computer Architectures

Dr. Jun Wang

2

Computer Architecture Trends

3Figure 1.1 H&P Growth in microprocessor 35% per year

4

Technology Trends

• Smaller feature sizes – higher speed, density

100 nm

1 um

10 um

1972 1976 1980 1984 1988 1992 1996 2000

400410um

80806um 8086

3um286

1.5um4861um 486SL

800nmPent.

600nmPent Pro350nm

Pent II250nm

Pent III180nm

Pent 4130nm

Density is increased by 77 times

5

Technology Trends• Larger chips

– Trend is toward more RAM, less logic per chip– Historically 2x per generation; leveling off?– McKinley has large on-chip caches

=> larger wafers to reduce fabricate costs 4004 8086 286 486 Pent. Pent 4

M cKinleyIA64

6

Moore’s Law• Number of transistors doubles every 18 months (amended

to 24 months)• Combination of both greater density and larger chips

1

100

10000

1972 1976 1980 1984 1988 1992 1996 2000

4004

8080

8086

286

486486SL

Pent.Pent Pro

Pent II

Pent III Pent 4

Th

ou

san

ds

of

Tra

ns

isto

rs

7

Tech. Trends, contd.• More, faster, cheaper transistors have fed

an application demand for higher performance– 1970s -- serial, 1-bit integer microprocessors– 1980s -- pipelined 32-bit RISC

• ISA simplicity allows processor on chip

– 1990s -- large, superscalar processors, even for CISC

– 2000s -- multiprocessors on a chip

...

8

Pipelining and Branch Prediction• Two basic ways of increasing performance• Pipelining:

• Branch Prediction– Speculate on branch outcome to avoid waiting

ME WBEXIF ID

latch latch latch latch latch latch

clock

9

Tech. Trend: memory sizes

• Memories have grown very dense– Feeding application demand for large, complex

software

1980 1984 1988 1992 1996 2000

Bit

s p

er D

RA

M C

hip

64Kb

256Kb

1M b

4M b

16M b

64M b

256M b

10

Tech. Trend: memory speeds

• Main memory speeds have not kept up with processor speeds

1980 1984 2000

Mem

/Pro

c C

ycle

Tim

e

100ns

120ns

140ns

160ns

80ns

200ns

220ns

240ns

260ns

180ns

60ns

1988 1992 1996

40ns

20ns

Memory

Processor

ratio : 1.3 to 1

ratio : 90+ to 1

11

Memory Hierarchies

• Gap between processor and memory performance has led to widespread use of memory hierarchies– 1960s no caches, no virtual memory– 1970s shared I & D-cache, 32-bit virtual memory– 1980s Split I- and D-caches– 1990s Two level caches, 64-bit virtual memory– 2000s Multi-level caches, both on and off-chip

12

Memory Hierarchies

MEMORYSYSTEM

PROCESSOR

Main Memory

L2 Cache

Registers

L1 Cache

L3 Cache

Small/Fast

Large/Slow

13

I/O a key system component• I/O has evolved into a major distinguishing feature

of computer systems– 1960s: disk, tape, punch cards, tty; batch processing– 1970s: character oriented displays– 1980s: video displays, audio, increasing disk sizes,

beginning networking – 1990s: 3D graphics; networking a fundamental element;

high quality audio– 2000s: real-time video, immersion…

14

I/O Systems• A hierarchy that

divides bandwidth

CD ROM

Hard Drives

Controller BufferFrame

Monitor

Expansion

Floppy

interface

DRAM

Proc

Local Bus

LAN

Interface

High Speed I/O bus

Controller

Slow Speed I/O bus

Floppy

Data ratesMemory: 100 MHz, 8 bytes 800 MB/s (peak)PCI: 33 MHz, 4 bytes wide 132 MB/s (peak)SCSI: “Ultra2” (40 MHz), “Wide” (2 bytes) 80 MB/s (peak)

15

Multiprocessors• Multiprocessors have been available for decades…

– 1960s small MPs– 1970s small MPs

• Dream of automatic parallelization

– 1980s small MPs; emergence of servers• Dream of automatic parallelization

– 1990s expanding MPs• Very large MPPs failed• Dream of automatic parallelization fading

– 2000s wide-spread MPs; on-chip multithreading• Many applications have independent threads• Programmers write applications to be parallel in the first place

C C

P

M

Interconnection Network

C

PP

MM . . .

. . .

16

Evaluating Computer Architectures

17

Computation Science

• Computation is synthetic– Many of the phenomena in the computing field are

created by humans rather than occurring naturally in the physical world

– Very different from nature sciences

» When one discovers a fact about nature, it is a contribution, no matter how small

» Creating something new alone does not establish a contribution

– Anyone can create something new in a synthetic field

– Rather, one must show that the creation is better

18

What Means “Better”?

• “Better” can mean many things– Solves a problem in less time (faster)– Solves a larger class of problems (more powerful)– Is more efficient of resources (cheaper)– Is less prone to errors (more reliable)– Is easy to manage/program (lower human cost)

19

Amdahl's Law

• Speedup due to enhancement E:

ExTime w/o E Performance w/ E

Speedup(E) = ------------------- = ------------------------

ExTime w/ E Performance w/o E

F E

• Find how Speedup coming from some enhancement E

• Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

• Defines the Speedup that can be gained by using a special feature

20

Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=

1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

21

• The “better” property is not simply an observation– Rather, the research will postulate that a new idea

• An architecture, algorithm, protocol, data structure, methodology, language, optimization or model, etc.

– Will lead to a “better” result

– Making the connection between the idea and the improvement is as important as quantifying how much the improvement is

• The contribution is the idea, and is generally a component of a larger computational system.

22

How to Evaluate Architecture Ideas

• Measuring/observing/analyzing real systems– Accurate results– Need a working system

» Too expensive to evaluate architecture/system ideas

23

• Analytic models– Fast & easy analysis of relations

– Tprogram = NumOfInst (Tcpu + Tm(1- Cachehit))

– Allows extrapolation to ridiculous parameters, e.g. thousands of processors

– Sometimes infeasible to obtain accuracy (e.g. modeling caches)

– To obtain reasonable accuracy, the models may become very complex (e.g. modeling of network contention)

– Queuing theory is a commonly used technique

24

• Simulation – The most popular method in computer

architecture and system research– Mimic the architecture/system using software – Very flexible: nearly unlimited evaluation – Prototyping of non-existing machines possible– Evaluation of design options (design space

exploration) cheap & flexible– Requires some sort of validation – Can be VERY slow

25

• Tradeoff between accuracy and computational intensity – Low level of abstraction slow (e.g. simulating

at the level of gates) – High level of abstraction fast (e.g. only

simulating processor, cache and memory components)

• Tradeoff may be intensified when modeling parallel architectures as multiple processors need to be simulated

26

Three Simulation Techniques• Profile-based static modeling

– Simplest and least costly– Use hardware counters on the chip or instrumented

execution (such as Beowulf Linux cluster Pgprof, SGI perfex and Alpha ATOM)

• Trace-driven – A more sophisticated technique– How it works (Ex. modeling memory system

performance):• Collect traces generated by ATOM• Trace format: inst address executed, data address accessed• Build the memory hierarchy model• Feed trace in the simulation model and analyze results

27

1. Compile: pgcc –Mprof=func prg.cc2. Run the code: to produce a profile data file called pgprof.out3. View the execution profile: pgprof prprof.out

28

Using Perfex• Usage :

perfex [-e num] [-y] program [program args]

-e num: count only event type num;

-y: generate a “cost report”;

• Example

perfex –e 41 –13 –y a.out

EVENT # Event Events Counted

41 Floating point OP retired 25292884493

13 L2 cache lines loaded 223490870

Statistics: MFLOPS 29.175907

Main memory L2 bandwidth 8.249655 MB/s

29

• Execution–driven– The most accurate and most costly– Trace-driven can not simulate the interaction

between memory system and processor– A detailed of the memory system and the

processor pipeline are done simultaneously by really executing program on top of a simulation framework like Simics, SimOS and SimpleScalar

30

Measuring by Means of Benchmarks

• Micro-benchmarks (e.g. instruction latencies, file system throughput)

• Application benchmarks: general system behavior (e.g. Spec2000 or SPLASH2)

• Only limited evaluation possible (e.g. limited systems support for measurement)

• The machine must be available• Benchmark Suites: Collection of kernels, real and

benchmark programs, lessening the weakness of any one benchmark by the presence of others.

31

Summarize Results• Weighted Arithmetic Mean Execution Time (Wi*Ti)– Summarize the products of weighting factors and

execution times and reflect individual frequency of each workload

– Wi = 1/(Timei * nj=1(1/Timej))

• Geometric Mean Execution Time (Ti/Ni)1/n

– Normalize execution times to a reference machine and take the average of normalized execution times

– Used by SPEC

32

A Report Example (P&H figure 1.17)

Normalized to A Normalized to B Normalized to C

A B C A B C A B C

Program P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0

Program P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

Arithmetic mean

1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0

Geometric

Mean

1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0

1 csce 930 advanced computer architecture lecture 1 evaluate computer architectures dr. jun wang

Documents

chip caches

memory performance

memory sizesmemories

bytes wide

small creating

computer architecturesdr

automatic parallelization1990s

branch predictionspeculate