instructor: s. masoud sadjadi cs.fiu/~sadjadi/teaching

89
1 Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu Concurrent Computers Concurrent Computers

Upload: alexandra-york

Post on 03-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Concurrent Computers. Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu. Acknowledgements. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

1

Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/

sadjadi At cs Dot fiu Dot edu

Concurrent ComputersConcurrent Computers

Page 2: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

2

Acknowledgements The content of many of the slides in this lecture notes have

been adopted from the online resources prepared previously by the people listed below. Many thanks!

Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova [email protected]

Kai Wang Department of Computer Science University of South Dakota http://www.usd.edu/~Kai.Wang

Andrew Tanenbaum

Page 3: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

3

Concurrency and Computers

We will see computer systems designed to allow concurrency (for performance benefits)

Concurrency occurs at many levels in computer systems Within a CPU

For example, On-Chip Parallelism Within a “Box”

For example, Coprocessor and Multiprocessor Across Boxes

For example, Multicomputers, Clusters, and Grids

Page 4: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

4

Parallel Computer Architectures

(a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor.

(d) A multicomputer. (e) A grid.

Page 5: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

5

Concurrency and Computers

We will see computer systems designed to allow concurrency (for performance benefits)

Concurrency occurs at many levels in computer systems Within a CPU Within a “Box” Across Boxes

Page 6: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

6

Concurrency within a CPU

CPU

Caches

RAM

Controllers Controllers

I/O devicesDisplaysKeyboards

Networks

adapters

Busses

Registers+ ALUs+ Hardware to decodeinstructions and do all types of useful things

Page 7: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

7

Concurrency within a CPU

Several techniques to allow concurrency within a single CPU Pipelining

RISC architectures Pipelined functional units

ILP Vector units On-Chip Multithreading

Let’s look at them briefly

Page 8: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

8

Concurrency within a CPU

Several techniques to allow concurrency within a single CPU Pipelining

RISC architectures Pipelined functional units

ILP Vector units On-Chip Multithreading

Let’s look at them briefly

Page 9: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

9

Pipelining If one has a sequence of tasks to do If each task consists of the same n steps or stages If different steps can be done simultaneously Then one can have a pipelined execution of the tasks

e.g., for assembly line Goal: higher throughput (i.e., number of tasks per time unit)

Time to do 1 task = 9Time to do 2 tasks = 13Time to do 3 tasks = 17Time to do 4 tasks = 21Time to do 10 tasks = 45Time to do 100 tasks = 409

Pays off if many tasks

Page 10: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

10

Pipelining

Each step goes as fast as the slowest stage

Therefore, the asymptotic throughput (i.e., the throughput when the number of tasks tends to infinity) is equal to:

1 / (duration of the slowest stage)

Therefore, in an ideal pipeline, all stages would be identical (balanced pipeline)

Question: Can we make computer instructions all consist of the same number of stage, where all stages take the same number of clock cycles?

duration of the slowest stage

Page 11: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

11

RISC Having all instructions doable in the same number of

stages of the same durations is the RISC idea Example:

MIPS architecture (See THE architecture book by Patterson and Hennessy)

5 stages Instruction Fetch (IF) Instruction Decode (ID) Instruction Execute (EX) Memory accesses (MEM) Register Write Back (WB)

Each stage takes one clock cycle

IF ID EX MEM WB

IF ID EX MEM WB

LD R2, 12(R3)

DADD R3, R5, R6

Concurrent executionof two instructions

Page 12: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

12

Pipelined Functional Units Although the RISC idea is attractive, some operations are just too

expensive to be done in one clock cycle (during the EX stage) Common example: floating point operations Solution: implement them as a sequence of stages, so that they can be

pipelined

IF ID MEM WB

EXInteger unit

M1

FP/integer multiply

M2 M3 M4 M5 M6 M7

A1 A2 A3 A4

FP/integer add

Page 13: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

13

Pipelining Today Pipelined functional units are common Fallacy: All computers today are RISC

RISC was of course one of the most fundamental “new” ideas in computer architectures

x86: Most commonly used Instruction Set Architecture today Kept around for backwards compatibility reasons, because it’s easy

to implement (not to program for) BUT: modern x86 processors decode instructions into “micro-ops”,

which are then executed in a RISC manner

Bottom line: pipelining is a pervasive (and conveniently hidden) form of concurrency in computers today

Take a computer architecture course to know all about it

Page 14: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

14

Concurrency within a CPU

Several techniques to allow concurrency within a single CPU Pipelining ILP Vector units On-Chip Multithreading

Page 15: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

15

Instruction Level Parallelism Instruction Level Parallelism is the set of techniques by

which performance of a pipelined processor can be pushed even further

ILP can be done by the hardware Dynamic instruction scheduling Dynamic branch predictions Multi-issue superscalar processors

ILP can be done by the compiler Static instruction scheduling Multi-issue VLIW (Very Long Instruction Word) processors

with multiple functional units Broad concept: More than one instruction is issued per clock

cycle e.g., 8-way multi-issue processor

Page 16: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

16

Concurrency within a CPU

Several techniques to allow concurrency within a single CPU Pipelining ILP Vector units On-Chip Multithreading

Page 17: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

17

Vector Units A functional unit that can do elt-wise operations

on entire vectors with a single instruction, called a vector instruction These are specified as operations on vector registers A “vector processor” comes with some number of

such registers MMX extension on x86 architectures

#elts adds in parallel+

. . . . . .

. . .

#elts#elts

#elts

Page 18: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

18

Vector Units Typically, a vector register holds ~ 32-64 elements But the number of elements is always larger than the

amount of parallel hardware, called vector pipes or lanes, say 2-4

#elts#elts

#elts

#elts / #pipes adds in parallel

+ + +

Page 19: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

19

MMX Extension Many techniques that are initially implemented in the “supercomputer”

market, find their way to the mainstream Vector units were pioneered in supercomputers

Supercomputers are mostly used for scientific computing Scientific computing uses tons of arrays (to represent mathematical vectors

and often does regular computation with these arrays Therefore, scientific code is easy to “vectorize”, i.e., to generate assembly

that uses the vector registers and the vector instructions Intel’s MMX or PowerPC’s AltiVec

MMX vector registers eight 8-bit elements four 16-bit elements two 32-bit elements

AltiVec: twice the lengths Used for “multi-media” applications

image processing rendering ...

Page 20: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

20

Vectorization Example Conversion from RGB to YUV

Y = (9798*R + 19235*G + 3736*B) / 32768;U = (-4784*R - 9437*G + 4221*B) / 32768 + 128;V = (20218*R - 16941*G - 3277*B) / 32768 + 128;

This kind of code is perfectly parallel as all pixels can be computed independently

Can be done easily with MMX vector capabilities Load 8 R values into an MMX vector register Load 8 G values into an MMX vector register Load 8 B values into an MMX vector register Do the *, +, and / in parallel Repeat

Page 21: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

21

Concurrency within a CPU

Several techniques to allow concurrency within a single CPU Pipelining ILP Vector units On-Chip Multithreading

Page 22: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

22

Multi-threaded Architectures Computer architecture is a difficult field to

make innovations in Who’s going to spend money to manufacture

your new idea? Who’s going to be convinced that a new compiler

can/should be written Who’s going to be convinced of a new approach

to computing? One of the “cool” innovations in the last

decade has been the concept of a “Multi-threaded Architecture”

Page 23: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

23

On-Chip Multithreading

Multithreading has been around for years, so what’s new about this?

Here we’re talking about Hardware Support for threads Simultaneous Multi Threading (SMT) SuperThreading HyperThreading

Let’s try to understand what all of these mean before looking at multi-threaded Supercomputers

Page 24: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

24

Single-treaded Processor CPU

Front-end: fetching/decoding/reordering

Execution core: actual execution

Multiple programs in memory Only one executes at a time

4-issue CPU with bubbles 7-unit CPU with pipeline bubbles

Time-slicing via context switching

Page 25: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

25

Single-threaded SMP?

Two threads execute at once, so threads spend less time waiting

The number of “bubbles” is also doubled Twice as much speed and twice as much waste

Page 26: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

26

Super-threading Principle: the processor can execute more

than one thread at a time Also called time-slice multithreading The processor is then called a multithreaded

processor Requires more hardware cleverness

logic switches at each cycle Leads to less Waste

A thread can run during a cycle while another thread is waiting for the memory

Just a finer grain of interleaving But there is a restriction

Each stage of the front end or the execution core only runs instructions from ONE thread!

Does not help with poor instruction parallelism within one thread

Does not reduce bubbles within a row

Page 27: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

27

Hyper-threading Principle: the processor can execute

more than one thread at a time, even within a single clock cycle!!

Requires even more hardware cleverness

logic switches within each cycle On the diagram: Only two threads

execute simultaneously. Intel’s hyper-threading only adds 5% to

the die area Some people argue that “two” is not

“hyper” Finest level of interleaving From the OS perspective, there are two

“logical” processors

Page 28: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

28

Concurrency and Computers

We will see computer systems designed to allow concurrency (for performance benefits)

Concurrency occurs at many levels in computer systems Within a CPU Within a “Box” Across Boxes

Page 29: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

29

Concurrency within a “Box”

Two main techniques SMP Multi-core

Let’s look at both of them

Page 30: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

30

SMPs Symmetric Multi-Processors

often mislabeled as “Shared-Memory Processors”, which has now become tolerated

Processors are all connected to a single memory Symmetric: each memory cell is equally close to all

processors Many dual-proc and quad-proc systems

e.g., for serversP1

network/bus

$

memory

P2

$

Pn

$

Page 31: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

31

Distributed caches The problem with distributed caches is that of

memory consistency Intuitive memory model

Reading an address should return the last value written to that address

Easy to do in uniprocessors although there may be some I/O issues

But difficult in multi-processor / multi-core Memory consistency: “A multiprocessor is sequentially

consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979]

Page 32: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

32

Cache Coherency Memory consistency is jeopardized by having

multiple caches P1 and P2 both have a cached copy of a data

item P1 write to it, possibly write-through to memory At this point P2 owns a stale copy

When designing a multi-processor system, one must ensure that this cannot happen By defining protocols for cache coherence

Page 33: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

33

Snoopy Cache-Coherence

Memory bus is a broadcast medium Caches contain information on which addresses they store Cache Controller “snoops” all transactions on the bus

A transaction is a relevant transaction if it involves a cache block currently contained in this cache

Take action to ensure coherence invalidate, update, or supply value

StateAddressData

P0

$ $

Pn

Mem Mem

memory busmemory op from Pn

bus snoop

Page 34: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

34

Limits of Snoopy Coherence

MEM MEM° ° °

PROC

cache

PROC

cache

° ° °

Assume:

4 GHz processor

=> 16 GB/s inst BW per processor (32-bit)

=> 9.6 GB/s data BW at 30% load-store of 8-byte elements

Suppose 98% inst hit rate and 90% data hit rate

=> 320 MB/s inst BW per processor

=> 960 MB/s data BW per processor

=> 1.28 GB/s combined BW

Assuming 10 GB/s bus bandwidth

8 processors will saturate the bus

25.6 GB/s

1.28 GB/s

Page 35: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

35

Sample Machines

Intel Pentium Pro Quad Coherent 4 processors

Sun Enterprise server Coherent Up to 16 processor and/or

memory-I/O cards

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 F

iber

Cha

nnel

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

Page 36: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

40

Concurrency within a “Box”

Two main techniques SMP Multi-core

Page 37: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

41

Moore’s Law

Moore's Law describes an important trend in the history of computer hardware The number of transistors that can be

inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years.

The observation was first made by Intel co-founder Gordon E. Moore in a 1965 paper.

The trend has continued for more than half a century and is not expected to stop for another decade at least.

Page 38: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

42

Moore’s Law! Many people interpret

Moore’s law as “computer gets twice as fast every 18 months”

which is not technically true it’s all about microprocessor

density But this is no longer true We should have 10GHz

processors right now And we don’t!

Page 39: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

43

No more Moore? We are used to getting faster CPUs all the time We are used for them to keep up with more demanding

software Known as “Andy giveth, and Bill taketh away”

Andy Grove Bill Gates

It’s a nice way to force people to buy computers often But basically, our computers get better, do more

things, and it just happens automatically Some people call this the “performance free lunch” Conventional wisdom: “Not to worry, tomorrow’s

processors will have even more throughput, and anyway today’s applications are increasingly throttled by factors other than CPU throughput and memory speed (e.g., they’re often I/O-bound, network-bound, database-bound).”

Page 40: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

44

Commodity improvements There are three main ways in which commodity

processors keep improving: Higher clock rate More aggressive instruction reordering and concurrent

units Bigger/faster caches

All applications can easily benefit from these improvements at the cost of perhaps a recompilation

Unfortunately, the first two are hitting their limit Higher clock rate lead to high heat, power consumption No more instruction reordering without compromising

correctness

Page 41: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

45

Is Moore’s laws not true? Ironically, Moore’s law is still true

The density indeed still doubles But its wrong interpretation is not

Clock rates do not doubled any more But we can’t let this happen: computers have to get more

powerful Therefore, the industry has thought of new ways to improve

them: multi-core Multiple CPUs on a single chip

Multi-core adds another level of concurrency But unlike, say multiple functional units, hard to compile for them

Therefore, applications must be rewritten to benefit from the (nowadays expected) performance increase

“Concurrency is the next major revolution in how we write software” (Dr. Dobb’s Journal, 30(3), March 2005)

Page 42: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

46

Multi-core processors In addition to putting concurrency in the public’s eye, multi-

core architectures will have deep impact Languages will be forced to deal well with concurrency

New language designs? New language extensions? New compilers?

Efficiency and Performance optimization will become more important: write code that is fast on one core with limited clock rate

The CPU may very well become a bottleneck (again) for single-core programs

Other factors will improve, but not the clock rate Prediction: many companies will be hiring people to (re)write

concurrent applications

Page 43: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

47

Multi-Core

Quote from PC World Magazine Summer 2005:

“Don't expect dual-core to be the top performer today for games and other demanding single-threaded applications. But that will change as applications are rewritten. For example, by year's end, Unreal Tournament should have released a new game engine that takes advantage of dual-core processing.“

Page 44: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

48

Concurrency and Computers

We will see computer systems designed to allow concurrency (for performance benefits)

Concurrency occurs at many levels in computer systems Within a CPU Within a “Box” Across Boxes

Page 45: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

49

Multiple boxes together Example

Take four “boxes” e.g., four Intel Itaniums bought at Dell

Hook them up to a network e.g., a switch bought at CISCO, Myricom, etc.

Install software that allows you to write/run applications that can utilize these four boxes concurrently

This is a simple way to achieve concurrency across computer systems

Everybody has heard of “clusters” by now They are basically like the above example and can be purchased

already built from vendors We will talk about this kind of concurrent platform at length

during this class

Page 46: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

50

Multiple Boxes Together Why do we use multiple boxes?

Every programmer would rather have an SMP/multi-core architecture that provides all the power/memory she/he needs

The problem is that single boxes do not scale to meet the needs of many scientific applications Can’t have enough processors or powerful enough cores Can’t have enough memory

But if you can live with a single box, do it! We will see that single-box programming is much easier

than multi-box programming

Page 47: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

51

Where does this leave us? So far we have seen many ways in which

concurrency can be achieved/implemented in computer systems Within a box Across boxes

So we could look at a system and just list all the ways in which it does concurrency

It would be nice to have a great taxonomy of parallel platforms in which we can pigeon-hole all (past and present) systems Provides simple names that everybody can use and

understand quickly

Page 48: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

52

Taxonomy of parallel machines?

It’s not going to happen Up until last year Gordon Bell and Jim Gray published an article in

Comm. of the ACM, discussing what the taxonomy should be Dongarra, Sterling, etc. answered telling them they were wrong and

saying what the taxonomy should be, and proposing a new multi-dimensional scheme!

Both papers agree that most terms are conflated, misused, etc. (MPP)

Complicated by the fact that concurrency appears at so many levels

Example: A 16-node cluster, where each node is a 4-way multi-processor, where each processor is hyperthreaded, has vector units, and is fully pipelined with multiple, pipelined functional units

Page 49: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

53

Taxonomy of platforms? We’ll look at one traditional taxonomy We’ll look at current categorizations from Top500 We’ll look at examples of platforms We’ll look at interesting/noteworthy architectural

features that one should know as part of one’s parallel computing culture

Page 50: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

54

The Flynn taxonomy Proposed in 1966!!! Functional taxonomy based on the notion of

streams of information: data and instructions Platforms are classified according to whether they

have a single (S) or multiple (M) stream of each of the above

Four possibilities SISD (sequential machine) SIMD MIMD MISD (rare, no commercial system... systolic arrays)

Page 51: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

55

Taxonomy of Parallel Computers

Flynn’s taxonomy of parallel computers.

Page 52: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

56

SIMD

PEs can be deactivated and activated on-the-fly Vector processing (e.g., vector add) is easy to

implement on SIMD Debate: is a vector processor an SIMD machine?

often confused strictly not true according to the taxonomy (it’s really SISD with

pipelined operations) but it’s convenient to think of the two as equivalent

ProcessingElement

ControlUnit

ProcessingElement

ProcessingElement

ProcessingElement

ProcessingElement

single streamof instructions

fetchdecodebroadcast

Page 53: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

57

MIMD Most general category Pretty much every supercomputer in existence today is a

MIMD machine at some level This limits the usefulness of the taxonomy But you had to have heard of it at least once because

people keep referring to it, somehow... Other taxonomies have been proposed, none very satisfying

Shared- vs. Distributed- memory is a common distinction among machines, but these days many are hybrid anyway

Page 54: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

58

Taxonomy of Parallel Computers

A taxonomy of parallel computers.

Page 55: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

59

A host of parallel machines

There are (have been) many kinds of parallel machines

For the last 12 years their performance has been measured and recorded with the LINPACK benchmark, as part of Top500

It is a good source of information about what machines are (were) and how they have evolved

Note that it’s really about “supercomputers”

http://www.top500.org

Page 56: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

61

What can we find on the Top500?

Page 57: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

62

Pies

Page 58: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

63

Top Ten Computers (http://www.top500.org)

Page 59: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

64

Top 500 Computers--Countries (http://www.top500.org)

Page 60: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

65

Top 500 Computers--Manufacturers (http://www.top500.org)

Page 61: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

66

Top 500 Computers—Manufacturers Trend (http://www.top500.org)

Page 62: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

67

Top 500 Computers--Operating Systems (http://www.top500.org)

Page 63: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

68

Top 500 Computers—Operating Systems Trend (http://www.top500.org)

Page 64: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

69

Top 500 Computers--Processors (http://www.top500.org)

Page 65: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

70

Top 500 Computers—Processors Trend (http://www.top500.org)

Page 66: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

71

Top 500 Computers--Customers (http://www.top500.org)

Page 67: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

72

Top 500 Computers—Customers Trend (http://www.top500.org)

Page 68: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

73

Top 500 Computers--Applications (http://www.top500.org)

Page 69: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

74

Top 500 Computers--Applications Trend (http://www.top500.org)

Page 70: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

75

Top 500 Computers—Architecture (http://www.top500.org)

Page 71: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

76

Top 500 Computers—Architecture Trend (http://www.top500.org)

Page 72: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

78

SMPs “Symmetric MultiProcessors” (often mislabeled as “Shared-Memory

Processors”, which has now become tolerated) Processors all connected to a (large) memory UMA: Uniform Memory Access, makes is easy to program Symmetric: all memory is equally close to all processors Difficult to scale to many processors (<32 typically) Cache Coherence via “snoopy caches” or “directories”

P1

network/bus

$

memory

P2

$

Pn

$

Page 73: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

79

Distributed Shared Memory Memory is logically shared, but physically distributed in banks

Any processor can access any address in memory Cache lines (or pages) are passed around the machine Cache coherence: Distributed Directories NUMA: Non-Uniform Memory Access (some processors may be closer to some banks)

SGI Origin2000 is a canonical example Scales to 100s of processors Hypercube topology for the memory (later)

P1

$

memory

P2

$

Pn

$

memory memory

memorynetwork

Page 74: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

80

Clusters, Constellations, MPPs These are the only 3 categories today in the Top500 They all belong to the Distributed Memory model (MIMD) (with many twists) Each processor/node has its own memory and cache but cannot directly access another processor’s memory.

nodes may be SMPs Each “node” has a network interface (NI) for all communication and synchronization.

So what are these 3 categories?

interconnect

P0

memory

NI

. . .

P1

memory

NI Pn

memory

NI

Page 75: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

81

Clusters 58.2% of the Top500 machines are labeled as “clusters” Definition: Parallel computer system comprising an

integrated collection of independent “nodes”, each of which is a system in its own right capable on independent operation and derived from products developed and marketed for other standalone purposes

A commodity cluster is one in which both the network and the compute nodes are available in the market

In the Top500, “cluster” means “commodity cluster” A well-known type of commodity clusters are “Beowulf-class

PC clusters”, or “Beowulfs”

Page 76: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

82

What is Beowulf? An experiment in parallel computing systems Established vision of low cost, high end computing,

with public domain software (and led to software development)

Tutorials and book for best practice on how to build such platforms

Today by Beowulf cluster one means a commodity cluster that runs Linux and GNU-type software

Project initiated by T. Sterling and D. Becker at NASA in 1994

Page 77: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

83

Constellations??? Commodity clusters that differ from the previous

ones by the dominant level of parallelism Clusters consist of nodes, and nodes are typically

SMPs If there are more procs in a node than nodes in the

cluster, then we have a constellation Typically, constellations are space-shared among

users, with each user running openMP on a node, although an app could run on the whole machine using MPI/openMP

To be honest, this term is not very useful and not very used.

Page 78: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

84

MPP???????? Probably the most imprecise term for describing a

machine (isn’t a 256-node cluster of 4-way SMPs massively parallel?)

May use proprietary networks, vector processors, as opposed to commodity component

Cray T3E, Cray X1, and Earth Simulator are distributed memory machines, but the nodes are SMPs.

Basicallly, everything that’s fast and not commodity is an MPP, in terms of today’s Top500.

Let’s look at these “non-commodity” things People’s definition of “commodity” varies

Page 79: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

85

Vector Processors

Vector architectures were based on a single processor Multiple functional units All performing the same operation Instructions may specify large amounts of parallelism (e.g., 64-

way) but hardware executes only a subset in parallel Historically important

Overtaken by MPPs in the 90s as seen in Top500 Re-emerging in recent years

At a large scale in the Earth Simulator (NEC SX6) and Cray X1 At a small scale in SIMD media extensions to microprocessors

SSE, SSE2 (Intel: Pentium/IA64) Altivec (IBM/Motorola/Apple: PowerPC) VIS (Sun: Sparc)

Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to

Page 80: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

86

Vector Processors Advantages

quick fetch and decode of a single instruction for multiple operations

the instruction provides the processor with a regular source of data, which can arrive at each cycle, and processed in a pipelined fashion

The compiler does the work for you of course Memory-to-memory

no registers can process very long vectors, but startup time is large appeared in the 70s and died in the 80s

Cray, Fujitsu, Hitachi, NEC

Page 81: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

87

Global Address Space

Cray T3D, T3E, X1, and HP Alphaserver cluster Network interface supports “Remote Direct Memory Access”

NI can directly access memory without interrupting the CPU One processor can read/write memory with one-sided operations

(put/get) Not just a load/store as on a shared memory machine Remote data is typically not cached locally

interconnect

P0

memory

NI

. . .

P1

memory

NI Pn

memory

NI

Page 82: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

95

Blue Gene/L 65,536 processors Relatively modest clock rates, so that power consumption is

low, cooling is easy, and space is small (1024 nodes in the same rack)

Besides, processor speed is on par with the memory speed so faster clock rate does not help

2-way SMP nodes (really different from the X1) several networks

64x32x32 3-D torus for point-to-point tree for collective operations and for I/O plus other Ethernet, etc.

Page 83: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

96

BlueGene

The BlueGene/L custom processor chip.

Page 84: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

97

BlueGeneThe BlueGene/L. (a) Chip. (b) Card. (c) Board.

(d) Cabinet. (e) System.

Page 85: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

100

If you like dead Supercomputers Lots of old supercomputers w/ pictures

http://www.geocities.com/Athens/6270/superp.html Dead Supercomputers

http://www.paralogos.com/DeadSuper/Projects.html e-Bay

Cray Y-MP/C90, 1993 $45,100.70 From the Pittsburgh Supercomputer Center who wanted to get rid of

it to make space in their machine room Original cost: $35,000,000 Weight: 30 tons Cost $400,000 to make it work at the buyer’s ranch in Northern

California

Page 86: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

101

Network Topologies People have experimented with different topologies for

distributed memory machines, or to arrange memory banks in NUMA shared-memory machines

Examples include: Ring: KSR (1991) 2-D grid: Intel Paragon (1992) Torus Hypercube: nCube, Intel iPSC/860, used in the SGI Origin

2000 for memory Fat-tree: IBM Colony and Federation Interconnects (SP-x) Arrangement of switches

pioneered with “Butterfly networks” like in the BBN TC2000 in the early 1990

200 MHz processors in a multi-stage network of switches Virtually Shared Distributed memory (NUMA) I actually worked with that one!

Page 87: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

102

Hypercube

Defined by its dimension, d

4D

3D

2D

1D

Page 88: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

103

Hypercube

Properties Has 2d nodes The number of hops between two nodes is at most d

The diameter of the network grows logarithmically with the number of nodes, which was the key for interest in hypercubes

But each node needs d neighbors, which is a problem Routing and Addressing

0110 0111 1110 1111

10111010

1101

1001

1100

1000

0101

00010000

0100

00110010

d-bit address routing from xxxx to yyyy:

just keep going to a neighbor that has a smaller Hamming distance

reminiscent of some p2p things

TONS of Hypercube research (even today!!)

Page 89: Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching

104

Conclusion Concurrency appears at all levels Both in “commodity systems” and in “supercomputers”

The distinction is rather annoying When needing performance, one has to exploit concurrency

to the best of its capabilities e.g., as a developer of a geophysics application to run on a 10,000

heavy-iron supercomputers at the SANDIA national lab e.g., as a game developer on a 8-way multi-core hyper-threaded

desktop system sold by Dell In this course we’ll gain hands-on understanding of how to

write concurrent/parallel software Using GCB and MIND clusters Using the LA Grid and Open Sciences Grid