cs433 spring 2001 introduction laxmikant kale. 2 course objectives and outline you will learn about:...

CS433Spring 2001Introduction

Laxmikant Kale

2

Course objectives and outline• You will learn about:

– Parallel programming models• Emphasis on 3: message passing, shared memory, and shared objects

• Ongoing evaluation and comparison of models

– Parallel application classes

– Parallel architectures• Message passing support, routing, interconnection networks

• Cache-coherent scalable shared memory, synchronization

• Relaxed consistency models

• Novel architectures: Tera, Blue Gene, Processors-in-memory

– Commonly needed parallel algorithms/operations

– Performance analysis of parallel applications

– Parallel application case studies

3

Project and homeworks• Significant (effort and grade percentage) course project

– groups of 5 students

• Homeworks/machine problems:– weekly (sometimes biweekly)

• Parallel machines:– NCSA Origin 2000, PC/SUN clusters

4

Resources• Much of the course will be run via the web

– Lecture slides, assignments, will be available on the course web page• http://www-courses.cs.uiuc.edu/~cs433

– Most of the reading material (papers, manuals) will be on the web

– Projects will coordinate and submit information on the web• Web pages for individual pages will be linked to the course web page

– Newsgroup: uiuc.class.cs433

• You are expected to read the newsgroup and web pages regularly

5

Advent of parallel computing

• “Parallel computing is necessary to increase speeds”– cry of the ‘70s

– processors kept pace with Moore’s law:• Doubling speeds every 18 months

• Now, finally, the time is ripe– uniprocessors are commodities (and proc. speeds shows signs of

slowing down)

– Highly economical to build parallel machines

6

Why parallel computing• It is the only way to increase speed beyond uniprocessors

– Except, of course, waiting for uniprocessors to become faster!

– Several applications require orders of magnitude higher performance than feasible on uniprocessors

• Cost effectiveness:– older argument

– in 1985, a supercomputer cost 2000 times more than a desktop, yet performed only 400 times faster.

– So: combine microcomputers to get speed at lower costs

– Incremental scalability:• can get inbetween performance points with 20, 50, 100,… processors

– But:• You may get speedup lower than 400 on 2000 processors!

• Microcomputers became faster, killing supercomputers, effectively

7

Technology Trends

The natural building block for multiprocessors is now also about the fastest!

Per

form

ance

0.1

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors

8

General Technology Trends

• Microprocessor performance increases 50% - 100% per year• Transistor count doubles every 3 years• DRAM size quadruples every 3 years• Huge investment per generation is carried by huge commodity market

• Not that single-processor performance is plateauing, but that parallelism is a natural way to improve it.

0

20

40

60

80

100

120

140

160

180

1987 1988 1989 1990 1991 1992

Integer FP

Sun 4

260

MIPS

M/120

IBM

RS6000

540MIPS

M2000

HP 9000

750

DEC

alpha

9

Technology: A Closer Look• Basic advance is decreasing feature size ( )

– Circuits become either faster or lower in power

• Die size is growing too– Clock rate improves roughly proportional to improvement in – Number of transistors improves like (or faster)

• Performance > 100x per decade; clock rate 10x, rest transistor count

• How to use more transistors?

– Parallelism in processing• multiple operations per cycle reduces CPI

– Locality in data access• avoids latency and reduces CPI• also improves processor utilization

– Both need resources, so tradeoff

• Fundamental issue is resource distribution, as in uniprocessorsProc $

Interconnect

10

Clock Frequency Growth Rate

• 30% per year

0.1

1

10

100

1,000

19701975

19801985

19901995

20002005

Clo

ck r

ate

(MH

z)

i4004i8008

i8080

i8086 i80286i80386

Pentium100

R10000

11

Transistor Count Growth Rate

• 100 million transistors on chip by early 2000’s A.D.• Transistor count grows much faster than clock rate

- 40% per year, order of magnitude more contribution in 2 decades

Tran

sist

ors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

19701975

19801985

19901995

20002005

i4004i8008

i8080

i8086

i80286i80386

R2000

Pentium R10000

R3000

12

Similar Story for Storage

• Divergence between memory capacity and speed – Capacity increased by 1000x from 1980-95, speed only 2x

– Gigabit DRAM by c. 2000, but gap with processor speed greater

• Larger memories are slower, while processors get faster– Need to transfer more data in parallel

– Need deeper cache hierarchies

– How to organize caches?

• Parallelism increases effective size of each level of hierarchy, without increasing access time

• Parallelism and locality within memory systems too– New designs fetch many bits within memory chip; follow with fast

pipelined transfer across narrower interface– Buffer caches most recently accessed data

• Disks too: Parallel disks plus caching

13

Architectural Trends• Architecture translates technology’s gifts to performance and

capability

• Resolves the tradeoff between parallelism and locality– Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip

connect

– Tradeoffs may change with scale and technology advances

• Understanding microprocessor architectural trends – Helps build intuition about design issues or parallel machines

– Shows fundamental role of parallelism even in “sequential” computers

• Four generations of architectural history:– Vaccum tube, transistor, IC, VLSI

– Here focus only on VLSI generation

• Greatest delineation in VLSI has been in type of parallelism exploited

14

Architectural Trends• Greatest trend in VLSI generation is increase in parallelism

– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit

• slows after 32 bit

• adoption of 64-bit now under way, 128-bit far (not performance issue)

• great inflection point when 32-bit micro and cache fit on a chip

– Mid 80s to mid 90s: instruction level parallelism

• pipelining and simple instruction sets, + compiler advances (RISC)

• on-chip caches and functional units => superscalar execution

• greater sophistication: out of order execution, speculation, prediction

– to deal with control transfer and latency problems

15

Economics

• Commodity microprocessors not only fast but CHEAP

• Development cost is tens of millions of dollars (5-100 typical)

• BUT, many more are sold compared to supercomputers

– Crucial to take advantage of the investment, and use the commodity building block

– Exotic parallel architectures no more than special-purpose

• Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors

• Standardization by Intel makes small, bus-based SMPs commodity

• Desktop: few smaller processors versus one larger one?– Multiprocessor on a chip

16

What to Expect?• Parallel Machine classes:

– Cost and usage defines a class! Architecture of a class may change.

– Desktops, Engineering workstations, database/web servers, suprtcomputers,

• Commodity (home/office) desktop:– less than $10,000

– possible to provide 10-50 processors for that price!

– Driver applications: • games, video /signal processing,

• possibly “peripheral” AI: speech recognition, natural language understanding (?), smart spaces and agents

• New applications?

17

Engineeering workstations• Price: less than $100,000 (used to be):

– new proce level acceptable may be $50,000

– 100+ processors, large memory,

– Driver applications:• CAD (Computer aided design) of various sorts

• VLSI

• Structural and mechanical simulations…

• Etc. (many specialized applications)

18

Commercial Servers• Price range: variable ($10,000 - several hundreds of thousands)

– defining characteristic: usage

– Database servers, decision support (MIS), web servers, e-commerce

• High availability, fault tolerance are main criteria

• Trends to watch out for:– Likely emergence of specialized architectures/systems

• E.g. Oracle’s “No Native OS” approach

• Currently dominated by database servers, and TPC benchmarks– TPC: transactions per second

– But this may change to data mining and application servers, with corresponding impact on architecure.

19

Supercomputers• “Definition”: expensive system?!

– Used to be defined by architecture (vector processors, ..)

– More than a million US dollars?

– Thousands of processors

• Driving applications– Grand challenges in science and engineering:

– Global weather modeling and forecast

– Rational Drug design / molecular simulations

– Processing of genetic (genome) information

– Rocket simulation

– Airplane design (wings and fluid flow..)

– Operations research?? Not recognized yet

– Other non-traditional applications?

20

Consider Scientific Supercomputing

• Proving ground and driver for innovative architecture and techniques – Market smaller relative to commercial as MPs become mainstream– Dominated by vector machines starting in 70s– Microprocessors have made huge gains in floating-point performance

• high clock rates

• pipelined floating point units (e.g., multiply-add every cycle)

• instruction-level parallelism

• effective use of caches (e.g., automatic blocking)

– Plus economics

• Large-scale multiprocessors replace vector supercomputers– Well under way already

21

Scientific Computing Demand

22

Engineering Computing Demand

• Large parallel machines a mainstay in many industries– Petroleum (reservoir analysis)

– Automotive (crash simulation, drag analysis, combustion efficiency),

– Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism),

– Computer-aided design

– Pharmaceuticals (molecular modeling)

– Visualization

• in all of the above

• entertainment (films like Toy Story)

• architecture (walk-throughs and rendering)

– Financial modeling (yield and derivative analysis)

– etc.

23

Applications: Speech and Image Processing

1980 1985 1990 1995

1 MIPS

10 MIPS

100 MIPS

1 GIPS

Sub-BandSpeech Coding

200 WordsIsolated SpeechRecognition

SpeakerVeri¼cation

CELPSpeech Coding

ISDN-CD StereoReceiver

5,000 WordsContinuousSpeechRecognition

HDTV Receiver

CIF Video

1,000 WordsContinuousSpeechRecognitionTelephone

NumberRecognition

10 GIPS

• Also CAD, Databases, . . .

• 100 processors gets you 10 years, 1000 gets you 20 !

24

Learning Curve for Parallel Applications

• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,

891 on 128-processor Cray T3D

25

Raw Uniprocessor Performance: LINPACK

LIN

PA

CK

(M

FL

OP

S)

1

10

100

1,000

10,000

1975 1980 1985 1990 1995 2000

CRAY n = 100 CRAY n = 1,000

Micro n = 100 Micro n = 1,000

CRAY 1s

Xmp/14se

Xmp/416Ymp

C90

T94

DEC 8200

IBM Power2/990MIPS R4400

HP9000/735DEC Alpha

DEC Alpha AXPHP 9000/750

IBM RS6000/540

MIPS M/2000

MIPS M/120

Sun 4/260

26

500 Fastest Computers

Nu

mb

er

of s

yste

ms

11/93 11/94 11/95 11/960

50

100

150

200

250

300

350

PVP MPP

SMP

319

106

284

239

63

187

313

198

110

10673

cs433 spring 2001 introduction laxmikant kale. 2 course objectives and outline you will learn about:...

Documents

course web pagenewsgroup

course web pagehttp

singleprocessor performance

powerdie size

inbetween performance

natural way

webweb pages

individual pages