1 cs 594 spring 2002 lecture 4: jack dongarra university of tennessee

1

CS 594 Spring 2002Lecture 4:

Jack Dongarra University of Tennessee

2

Plan For Today

Dr. David Cronk on Homework #2 Finish Lecture: Parallel

Architectures and Programming Floating point arithmetic

Programming Model 3 Data ParallelSingle sequential thread of control consisting of parallel

operationsParallel operations applied to all (or defined subset) of a

data structureCommunication is implicit in parallel operators and

“shifted” data structuresElegant and easy to understand and reason aboutNot all problems fit this model

Like marching in a regiment

A:

fA:f

sum

A = array of all datafA = f(A)s = sum(fA)

° Think of Matlabs:

Model 3 Vector ComputingOne instruction executed across all the data in a

pipelined fashionParallel operations applied to all (or defined subset) of a

data structureCommunication is implicit in parallel operators and

“shifted” data structuresElegant and easy to understand and reason aboutNot all problems fit this model

Like marching in a regiment

A:

fA:f

sum

A = array of all datafA = f(A)s = sum(fA)

° Think of Matlabs:

5

Machine Model 3

An SIMD (Single Instruction Multiple Data) machine A large number of small processors A single “control processor” issues each instruction

each processor executes the same instructionsome processors may be turned off on any instruction

interconnect

P1

memory

NI P2

memory

NI Pn

memory

NI

. . .

Machines not popular (CM2), but programming model isimplemented by mapping n-fold parallelism to p

processorsmostly done in the compilers (HPF = High Performance

Fortran)

control processor

6

Machine Model 4

Since small shared memory machines (SMPs) are the fastest commodity machine, why not build a larger machine by connecting many of them with a network?

CLUMP = Cluster of SMPs Shared memory within one SMP, message passing outside

Clusters, ASCI Red (Intel), ... Programming model?

Treat machine as “flat”, always use message passing, even within SMP (simple, but ignore important part of memory hierarchy)

Expose two layers: shared memory (OpenMP) and message passing (MPI) higher performance, but ugly to program.

Programming Model 5 Bulk Synchronous Processing (BSP) – L. Valiant

Used within the message passing or shared memory models as a programming convention

Phases separated by global barriersCompute phases: all operate on local data (in

distributed memory)» or read access to global data (in shared

memory)Communication phases: all participate in

rearrangement or reduction of global data Generally all doing the “same thing” in a phaseall do f, but may all do different things within f

Simplicity of data parallelism without restrictions

8

Summary so far

Historically, each parallel machine was unique, along with its programming model and programming language

You had to throw away your software and start over with each new kind of machine - ugh

Now we distinguish the programming model from the underlying machine, so we can write portably correct code, that runs on many machines

MPI now the most portable option, but can be tedious Writing portably fast code requires tuning for the architecture

Algorithm design challenge is to make this process easy Example: picking a blocksize, not rewriting whole

algorithm

9

Recap Parallel Comp. Architecture driven by familiar

technological and economic forcesapplication/platform cycle, but focused on the

most demanding applicationshardware/software learning curve

More attractive than ever because ‘best’ building block - the microprocessor - is also the fastest BB.

History of microprocessor architecture is parallelismtranslates area and denisty into performance

The Future is higher levels of parallelismParallel Architecture concepts apply at many levelsCommunication also on exponential curve

=> Quantitative Engineering approach

New ApplicationsMore Performance

Speedup

13

Performance Numbers on RISC Processors Using Linpack Benchmark

Machine MHz Linpack n=100

Mflop/s

Ax=b n=1000 Mflop/s

Peak Mflop/s

Intel P4 2200 1033 (47%) 1911 (86%) 2200 Compaq Alpha 1000 824 (41%) 1542 (77%) 2000 Intel/HP Itanium 800 600 (19%) 2382 (74%) 3200 AMD Athlon 1200 558 (23%) 998 (42%) 2400 HP PA 550 468 (21%) 1583 (71%) 2200 IBM Power 3 375 424 (28%) 1208 (80%) 1500 Intel P3 933 234 (25%) 514 (55%) 933 PowerPC G4 533 231 (22%) 478 (45%) 1066 SUN Ultra 80 450 208 (23%) 607 (67%) 900 SGI Origin 2K 300 173 (29%) 553 (92%) 600 Cray T90 454 705 (39%) 1603 (89%) 1800 Cray C90 238 387 (41%) 902 (95%) 952 Cray Y-MP 166 161 (48%) 324 (97%) 333 Cray X-MP 118 121 (51%) 218 (93%) 235 Cray J-90 100 106 (53%) 190 (95%) 200 Cray 1 80 27 (17%) 110 (69%) 160

14

Consider Scientific Supercomputing

Proving ground and driver for innovative architecture and techniques

Market smaller relative to commercial as MPs become mainstream

Dominated by vector machines starting in 70s Microprocessors have made huge gains in floating-point

performance» high clock rates » pipelined floating point units (e.g., multiply-add every

cycle)» instruction-level parallelism» effective use of caches (e.g., automatic blocking)

Plus economics

Large-scale multiprocessors replace vector supercomputers

15

Architectures

Single Processor

SMP

MPP

SIMD

Constellation

Cluster - NOW

0

100

200

300

400

500

Y-MP C90

Sun HPC

Paragon

CM5T3D

T3E

SP2

Cluster of Sun HPC

ASCI Red

CM2

VP500

SX3

Constellation: # of p/n n

16

Chip Technology

Alpha

Power

HP

intel

MIPS

Sparcother COTS

proprietary

0

100

200

300

400

500

17

Manufacturer

Cray

SGI

IBM

Sun

HP

TMC

Intel

FujitsuNEC

Hitachiothers

0

100

200

300

400

500

IBM 32%, HP 30%, SGI 8%, Cray 8%, SUN 6%, Fuji 4%, NEC 3%, Hitachi 3%

18

High-Performance Computing Directions: Beowulf-class PC Clusters

COTS PC Nodes Pentium, Alpha, PowerPC,

SMP COTS LAN/SAN Interconnect

Ethernet, Myrinet, Giganet, ATM

Open Source Unix Linux, BSD

Message Passing Computing MPI, PVM HPF

Best price-performance Low entry-level cost Just-in-place

configuration Vendor invulnerable Scalable Rapid technology

tracking

Definition: Advantages:

Enabled by PC hardware, networks and operating system achieving capabilities of scientific workstations at a fraction of the cost and availability of industry standard message passing libraries. However, much more of a contact sport.

19

Peak performance Interconnection http://clusters.top500.org Benchmark results to follow in the coming

months

20

Distributed and Parallel Systems

Distributedsystemshetero-geneous

Massivelyparallelsystemshomo-geneous

Grid

Com

putin

gB

eow

ulf

Ber

kley

NO

WS

NL

Cpl

ant

Ent

ropi

a

AS

CI T

flops

Gather (unused) resources Steal cycles System SW manages resources System SW adds value 10% - 20% overhead is OK Resources drive applications Time to completion is not

critical Time-shared

Bounded set of resources Apps grow to consume all cycles Application manages resources System SW gets in the way 5% overhead is maximum Apps drive purchase of

equipment Real-time constraints Space-shared

SE

TI@

hom

e

Par

alle

l Dis

t mem

21

Different Parallel Architectures Parallel computing: single systems with

many processors working on same problem

Distributed computing: many systems loosely coupled by a scheduler to work on related problems

Grid Computing: many systems tightly coupled by software, perhaps geographically distributed, to work together on single problems or on related problems

22

Historical Development

P

P

C

C

I/O

I/O

M MM M

PP

C

I/O

M MC

I/O

$ $

“Mainframe” approach Motivated by multiprogramming Extends crossbar used for Mem and I/O Processor cost-limited => crossbar Bandwidth scales with p High incremental cost

» use multistage instead

“Minicomputer” approach Almost all microprocessor systems have

bus Motivated by multiprogramming, TP Used heavily for parallel computing Called symmetric multiprocessor (SMP) Latency larger than for uniprocessor Bus is bandwidth bottleneck

» caching is key: coherence problem Low incremental cost

23

Shared Virtual Address Space Process = address space plus thread of control Virtual-to-physical mapping can be established so

that processes shared portions of address space. User-kernel or multiple processes

Multiple threads of control on one address space. Popular approach to structuring OS’s Now standard application capability (ex: POSIX

threads) Writes to shared address visible to other threads

Natural extension of uniprocessors model conventional memory operations for communication special atomic operations for synchronization

» also load/stores

24

Engineering: Intel Pentium Pro Quad

All coherence and multiprocessing glue in processor module

Highly integrated, targeted at high volume

Low latency and bandwidth

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

25

Engineering: SUN Enterprise

Proc + mem card - I/O card16 cards of either typeAll memory accessed over bus, so symmetricHigher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 F

iber

Cha

nnel

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

26

Scaling Up

Problem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than

crossbar» latencies to memory uniform, but uniformly large

Distributed memory or non-uniform memory access (NUMA)» Construct shared address space out of simple message

transactions across a general-purpose network (e.g. read-request, read-response)

Caching shared (particularly nonlocal) data?

M M M

M M M

NetworkNetwork

P

$

P

$

P

$

P

$

P

$

P

$

“Dance hall” Distributed memory

27

Engineering: Cray T3E

Scale up to 1024 processors, 480MB/s linksMemory controller generates request message for non-local referencesNo hardware mechanism for coherence

» SGI Origin etc. provide this

Switch

P

$

XY

Z

External I/O

Memctrl

and NI

Mem

28

Diminishing Role of Topology

Shift to general links DMA, enabling non-blocking ops

» Buffered by system at destination until recv

Store&forward routing Diminishing role of topology

Any-to-any pipelined routing node-network interface dominates

communication time

Simplifies programming Allows richer design space

» grids vs hypercubes

H x (T0 + n/B)

vs

T0 + H + n/B

Intel iPSC/1 -> iPSC/2 -> iPSC/860

29

Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Super computer

30

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

Building on the mainstream: IBM SP-2

Made out of essentially complete RS6000 workstations

Network interface integrated in I/O bus (bw limited by I/O bus)

31

A Little History Von Neumann and Goldstine - 1947

“Can’t expect to solve most big [n>15] linear systems without carrying many decimal digits [d>8], otherwise the computed answer would be completely inaccurate.” - WRONG!

Turing - 1949 “Carrying d digits is equivalent to changing the input data in the d-th place and

then solving Ax=b. So if A is only known to d digits, the answer is as accurate as the data deserves.”

Backward Error Analysis Rediscovered in 1961 by Wilkinson and publicized Starting in the 1960s- many papers doing backward error analysis of

various algorithms Many years where each machine did FP arithmetic slightly differently

Both rounding and exception handling differed Hard to write portable and reliable software Motivated search for industry-wide standard, beginning late 1970s First implementation: Intel 8087

ACM Turing Award 1989 to W. Kahan for design of the IEEE Floating Point Standards 754 (binary) and 854 (decimal)

Nearly universally implemented in general purpose machines

32

Defining Floating Point Arithmetic

Representable numbersScientific notation: +/- d.d…d x rexp

sign bit +/-radix r (usually 2 or 10, sometimes 16)significand d.d…d (how many base-r digits d?)exponent exp (range?)others?

Operations:arithmetic: +,-,x,/,...

» how to round result to fit in formatcomparison (<, =, >)conversion between different formats

» short to long FP numbers, FP to integerexception handling

» what to do for 0/0, 2*largest_number, etc.binary/decimal conversion

» for I/O, when radix not 10 Language/library support for these operations

33

IEEE Floating Point Arithmetic Standard 754 - Normalized Numbers

Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp

Macheps = Machine epsilon = 2-#significand bits = relative error in each operation OV = overflow threshold = largest number UN = underflow threshold = smallest number

+- Zero: +-, significand and exponent all zeroWhy bother with -0 later

Format # bits #significand bits macheps #exponent bits exponent range---------- -------- ----------------------- ------------ -------------------- ----------------------Single 32 23+1 2-24 (~10-7) 8 2-126 - 2127 (~10+-38)Double 64 52+1 2-53 (~10-16) 11 2-1022 - 21023 (~10+-308)Double >=80 >=64 <=2-64(~10-19) >=15 2-16382 - 216383 (~10+-4932) Extended (80 bits on all Intel machines)

34

IEEE Floating Point Arithmetic Standard 754 - “Denorms” Denormalized Numbers: +-0.d…d x 2min_exp

sign bit, nonzero significand, minimum exponent Fills in gap between UN and 0

Underflow Exception occurs when exact nonzero result is less than underflow threshold UN Ex: UN/3 return a denorm, or zero

Why bother? Necessary so that following code never divides by zero if (a != b) then x = a/(a-b)

35

IEEE Floating Point Arithmetic Standard 754 - +- Infinity

+- Infinity: Sign bit, zero significand, maximum exponent Overflow Exception

occurs when exact finite result too large to represent accurately

Ex: 2*OVreturn +- infinity

Divide by zero Exceptionreturn +- infinity = 1/+-0 sign of zero important!

Also return +- infinity for3+infinity, 2*infinity, infinity*infinityResult is exact, not an exception!

36

IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number) NAN: Sign bit, nonzero significand, maximum exponent Invalid Exception

occurs when exact result not a well-defined real number0/0sqrt(-1)infinity-infinity, infinity/infinity, 0*infinityNAN + 3NAN > 3?Return a NAN in all these cases

Two kinds of NANsQuiet - propagates without raising an exceptionSignaling - generate an exception when touched

» good for detecting uninitialized data

37

Error Analysis Basic error formula

fl(a op b) = (a op b)*(1 + d) where» op one of +,-,*,/» |d| <= macheps» assuming no overflow, underflow, or divide by zero

Example: adding 4 numbersfl(x1+x2+x3+x4) = {[(x1+x2)*(1+d1) + x3]*(1+d2) + x4}*(1+d3)

= x1*(1+d1)*(1+d2)*(1+d3) + x2*(1+d1)*(1+d2)*(1+d3) +

x3*(1+d2)*(1+d3) + x4*(1+d3)

= x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4)

where each |ei| <~ 3*machepsget exact sum of slightly changed summands xi*(1+ei)Backward Error Analysis - algorithm called numerically

stable if it gives the exact result for slightly changed inputsNumerical Stability is an algorithm design goal

38

Backward error Approximate solution is exact solution to

modified problem. How large a modification to original problem is

required to give result actually obtained? How much data error in initial input would be

required to explain all the error in computed results?

Approximate solution is good if it is exact solution to “nearby” problem.

ff(x)

f’(x)

f’

fx’

x

Forward errorBackward error

39

Sensitivity and Conditioning Problem is insensitive or well

conditioned if relative change in input causes commensurate relative change in solution.

Problem is sensitive or ill-conditioned, if relative change in solution can be much larger than that in input data.

Cond = |Relative change in solution| / |Relative change in input data|

= |[f(x’) – f(x)]/f(x)| / |(x’ – x)/x|

Problem is sensitive, or ill-conditioned, if cond >> 1.

When function f is evaluated for approximate input x’ = x+h instead of true input value of x.

Absolute error = f(x + h) – f(x) h f’(x) Relative error =[ f(x + h) – f(x) ] / f(x) h f’(x) / f(x)

40

Sensitivity: 2 Examplescos(π/2) and 2-d System of Equations

Consider problem of computing cosine function for arguments near π/2.

Let x π/2 and let h be small perturbation to x. Then

absolute error = cos(x+h) – cos(x) -h sin(x) -h,

relative error -h tan(x) ∞

So small change in x near π/2 causes large relative change in cos(x) regardless of method used.

cos(1.57079) = 0.63267949 x 10-5

cos(1.57078) = 1.64267949 x 10-5 Relative change in output is a

quarter million times greater than relative change in input.

.

41

Sensitivity: 2 Examplescos(π/2) and 2-d System of Equations

Consider problem of computing cosine function for arguments near π/2.

Let x π/2 and let h be small perturbation to x. Then

absolute error = cos(x+h) – cos(x) -h sin(x) -h,

relative error -h tan(x) ∞

So small change in x near π/2 causes large relative change in cos(x) regardless of method used.

cos(1.57079) = 0.63267949 x 10-5

cos(1.57078) = 1.64267949 x 10-5 Relative change in output is a

quarter million times greater than relative change in input.

.

.

42

Example: Polynomial Evaluation Using Horner’s Rule

Horner’s rule to evaluate p = ck * xk

p = cn, for k=n-1 down to 0, p = x*p + ck

Numerically StableApply to (x-2)9 = x9 - 18*x8 + … - 512 -29 + x*( 28 - x*( 27 + … )))Evaluated around 2

k=0

n

43

Example: polynomial evaluation (continued)

(x-2)9 = x9 - 18*x8 + … - 512We can compute error bounds using

fl(a op b)=(a op b)*(1+d)

44

What happens when the “exact value” is not a real

number, or is too small or too large to represent accurately?

You get an “exception”

45

Exception Handling

What happens when the “exact value” is not a real number, or too small or too large to represent accurately?

5 Exceptions:Overflow - exact result > OV, too large to representUnderflow - exact result nonzero and < UN, too small

to representDivide-by-zero - nonzero/0Invalid - 0/0, sqrt(-1), …Inexact - you made a rounding error (very common!)

Possible responsesStop with error message (unfriendly, not default)Keep computing (default, but how?)

46

Summary of Values Representable in IEEE FP

+- ZeroNormalized nonzero numbersDenormalized numbers+-InfinityNANs

Signaling and quietMany systems have only quiet

+-

+-

+-

+-

+-

0…0 0……………………0

0…0 nonzero

1….1 0……………………0

1….1 nonzero

Not 0 or all 1s

anything

1 cs 594 spring 2002 lecture 4: jack dongarra university of tennessee

Documents

parallel machine

data structurecommunication

local data

shifted data structureselegant

programming conventionphases

programming languageyou

parallel operators

underlying machine