cs433 spring 2001 introduction laxmikant kale. 2 course objectives and outline you will learn about:...
TRANSCRIPT
CS433Spring 2001Introduction
Laxmikant Kale
2
Course objectives and outline• You will learn about:
– Parallel programming models• Emphasis on 3: message passing, shared memory, and shared objects
• Ongoing evaluation and comparison of models
– Parallel application classes
– Parallel architectures• Message passing support, routing, interconnection networks
• Cache-coherent scalable shared memory, synchronization
• Relaxed consistency models
• Novel architectures: Tera, Blue Gene, Processors-in-memory
– Commonly needed parallel algorithms/operations
– Performance analysis of parallel applications
– Parallel application case studies
3
Project and homeworks• Significant (effort and grade percentage) course project
– groups of 5 students
• Homeworks/machine problems:– weekly (sometimes biweekly)
• Parallel machines:– NCSA Origin 2000, PC/SUN clusters
4
Resources• Much of the course will be run via the web
– Lecture slides, assignments, will be available on the course web page• http://www-courses.cs.uiuc.edu/~cs433
– Most of the reading material (papers, manuals) will be on the web
– Projects will coordinate and submit information on the web• Web pages for individual pages will be linked to the course web page
– Newsgroup: uiuc.class.cs433
• You are expected to read the newsgroup and web pages regularly
5
Advent of parallel computing
• “Parallel computing is necessary to increase speeds”– cry of the ‘70s
– processors kept pace with Moore’s law:• Doubling speeds every 18 months
• Now, finally, the time is ripe– uniprocessors are commodities (and proc. speeds shows signs of
slowing down)
– Highly economical to build parallel machines
6
Why parallel computing• It is the only way to increase speed beyond uniprocessors
– Except, of course, waiting for uniprocessors to become faster!
– Several applications require orders of magnitude higher performance than feasible on uniprocessors
• Cost effectiveness:– older argument
– in 1985, a supercomputer cost 2000 times more than a desktop, yet performed only 400 times faster.
– So: combine microcomputers to get speed at lower costs
– Incremental scalability:• can get inbetween performance points with 20, 50, 100,… processors
– But:• You may get speedup lower than 400 on 2000 processors!
• Microcomputers became faster, killing supercomputers, effectively
7
Technology Trends
The natural building block for multiprocessors is now also about the fastest!
Per
form
ance
0.1
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
8
General Technology Trends
• Microprocessor performance increases 50% - 100% per year• Transistor count doubles every 3 years• DRAM size quadruples every 3 years• Huge investment per generation is carried by huge commodity market
• Not that single-processor performance is plateauing, but that parallelism is a natural way to improve it.
0
20
40
60
80
100
120
140
160
180
1987 1988 1989 1990 1991 1992
Integer FP
Sun 4
260
MIPS
M/120
IBM
RS6000
540MIPS
M2000
HP 9000
750
DEC
alpha
9
Technology: A Closer Look• Basic advance is decreasing feature size ( )
– Circuits become either faster or lower in power
• Die size is growing too– Clock rate improves roughly proportional to improvement in – Number of transistors improves like (or faster)
• Performance > 100x per decade; clock rate 10x, rest transistor count
• How to use more transistors?
– Parallelism in processing• multiple operations per cycle reduces CPI
– Locality in data access• avoids latency and reduces CPI• also improves processor utilization
– Both need resources, so tradeoff
• Fundamental issue is resource distribution, as in uniprocessorsProc $
Interconnect
10
Clock Frequency Growth Rate
• 30% per year
0.1
1
10
100
1,000
19701975
19801985
19901995
20002005
Clo
ck r
ate
(MH
z)
i4004i8008
i8080
i8086 i80286i80386
Pentium100
R10000
11
Transistor Count Growth Rate
• 100 million transistors on chip by early 2000’s A.D.• Transistor count grows much faster than clock rate
- 40% per year, order of magnitude more contribution in 2 decades
Tran
sist
ors
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
19701975
19801985
19901995
20002005
i4004i8008
i8080
i8086
i80286i80386
R2000
Pentium R10000
R3000
12
Similar Story for Storage
• Divergence between memory capacity and speed – Capacity increased by 1000x from 1980-95, speed only 2x
– Gigabit DRAM by c. 2000, but gap with processor speed greater
• Larger memories are slower, while processors get faster– Need to transfer more data in parallel
– Need deeper cache hierarchies
– How to organize caches?
• Parallelism increases effective size of each level of hierarchy, without increasing access time
• Parallelism and locality within memory systems too– New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface– Buffer caches most recently accessed data
• Disks too: Parallel disks plus caching
13
Architectural Trends• Architecture translates technology’s gifts to performance and
capability
• Resolves the tradeoff between parallelism and locality– Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip
connect
– Tradeoffs may change with scale and technology advances
• Understanding microprocessor architectural trends – Helps build intuition about design issues or parallel machines
– Shows fundamental role of parallelism even in “sequential” computers
• Four generations of architectural history:– Vaccum tube, transistor, IC, VLSI
– Here focus only on VLSI generation
• Greatest delineation in VLSI has been in type of parallelism exploited
14
Architectural Trends• Greatest trend in VLSI generation is increase in parallelism
– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
• slows after 32 bit
• adoption of 64-bit now under way, 128-bit far (not performance issue)
• great inflection point when 32-bit micro and cache fit on a chip
– Mid 80s to mid 90s: instruction level parallelism
• pipelining and simple instruction sets, + compiler advances (RISC)
• on-chip caches and functional units => superscalar execution
• greater sophistication: out of order execution, speculation, prediction
– to deal with control transfer and latency problems
15
Economics
• Commodity microprocessors not only fast but CHEAP
• Development cost is tens of millions of dollars (5-100 typical)
• BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the commodity building block
– Exotic parallel architectures no more than special-purpose
• Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors
• Standardization by Intel makes small, bus-based SMPs commodity
• Desktop: few smaller processors versus one larger one?– Multiprocessor on a chip
16
What to Expect?• Parallel Machine classes:
– Cost and usage defines a class! Architecture of a class may change.
– Desktops, Engineering workstations, database/web servers, suprtcomputers,
• Commodity (home/office) desktop:– less than $10,000
– possible to provide 10-50 processors for that price!
– Driver applications: • games, video /signal processing,
• possibly “peripheral” AI: speech recognition, natural language understanding (?), smart spaces and agents
• New applications?
17
Engineeering workstations• Price: less than $100,000 (used to be):
– new proce level acceptable may be $50,000
– 100+ processors, large memory,
– Driver applications:• CAD (Computer aided design) of various sorts
• VLSI
• Structural and mechanical simulations…
• Etc. (many specialized applications)
18
Commercial Servers• Price range: variable ($10,000 - several hundreds of thousands)
– defining characteristic: usage
– Database servers, decision support (MIS), web servers, e-commerce
• High availability, fault tolerance are main criteria
• Trends to watch out for:– Likely emergence of specialized architectures/systems
• E.g. Oracle’s “No Native OS” approach
• Currently dominated by database servers, and TPC benchmarks– TPC: transactions per second
– But this may change to data mining and application servers, with corresponding impact on architecure.
19
Supercomputers• “Definition”: expensive system?!
– Used to be defined by architecture (vector processors, ..)
– More than a million US dollars?
– Thousands of processors
• Driving applications– Grand challenges in science and engineering:
– Global weather modeling and forecast
– Rational Drug design / molecular simulations
– Processing of genetic (genome) information
– Rocket simulation
– Airplane design (wings and fluid flow..)
– Operations research?? Not recognized yet
– Other non-traditional applications?
20
Consider Scientific Supercomputing
• Proving ground and driver for innovative architecture and techniques – Market smaller relative to commercial as MPs become mainstream– Dominated by vector machines starting in 70s– Microprocessors have made huge gains in floating-point performance
• high clock rates
• pipelined floating point units (e.g., multiply-add every cycle)
• instruction-level parallelism
• effective use of caches (e.g., automatic blocking)
– Plus economics
• Large-scale multiprocessors replace vector supercomputers– Well under way already
21
Scientific Computing Demand
22
Engineering Computing Demand
• Large parallel machines a mainstay in many industries– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
• in all of the above
• entertainment (films like Toy Story)
• architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)
– etc.
23
Applications: Speech and Image Processing
1980 1985 1990 1995
1 MIPS
10 MIPS
100 MIPS
1 GIPS
Sub-BandSpeech Coding
200 WordsIsolated SpeechRecognition
SpeakerVeri¼cation
CELPSpeech Coding
ISDN-CD StereoReceiver
5,000 WordsContinuousSpeechRecognition
HDTV Receiver
CIF Video
1,000 WordsContinuousSpeechRecognitionTelephone
NumberRecognition
10 GIPS
• Also CAD, Databases, . . .
• 100 processors gets you 10 years, 1000 gets you 20 !
24
Learning Curve for Parallel Applications
• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,
891 on 128-processor Cray T3D
25
Raw Uniprocessor Performance: LINPACK
LIN
PA
CK
(M
FL
OP
S)
1
10
100
1,000
10,000
1975 1980 1985 1990 1995 2000
CRAY n = 100 CRAY n = 1,000
Micro n = 100 Micro n = 1,000
CRAY 1s
Xmp/14se
Xmp/416Ymp
C90
T94
DEC 8200
IBM Power2/990MIPS R4400
HP9000/735DEC Alpha
DEC Alpha AXPHP 9000/750
IBM RS6000/540
MIPS M/2000
MIPS M/120
Sun 4/260
26
500 Fastest Computers
Nu
mb
er
of s
yste
ms
11/93 11/94 11/95 11/960
50
100
150
200
250
300
350
PVP MPP
SMP
319
106
284
239
63
187
313
198
110
10673