parallel computer architecture - university of...

81
Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science

Upload: lamthuy

Post on 19-Feb-2018

275 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Parallel Computer Architecture

Introduction to Parallel Computing CIS 410/510

Department of Computer and Information Science

Page 2: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Outline q  Parallel architecture types q  Instruction-level parallelism q  Vector processing q  SIMD q  Shared memory

❍  Memory organization: UMA, NUMA ❍  Coherency: CC-UMA, CC-NUMA

q  Interconnection networks q  Distributed memory q  Clusters q  Clusters of SMPs q  Heterogeneous clusters of SMPs

2 Introduction to Parallel Computing, University of Oregon, IPCC

Page 3: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

•  Shared Memory Multiprocessor (SMP) –  Shared memory address space –  Bus-based memory system

–  Interconnection network

Parallel Architecture Types •  Uniprocessor

–  Scalar processor

–  Vector processor

–  Single Instruction Multiple Data (SIMD)

processor  

memory  

processor  

memory  

vector  

processor   processor  

memory  

bus  

processor   processor  

memory  

network  

…  

…  

…  processor  

memory  …  

3 Introduction to Parallel Computing, University of Oregon, IPCC

Page 4: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Parallel Architecture Types (2) •  Distributed Memory

Multiprocessor –  Message passing

between nodes

–  Massively Parallel Processor (MPP)

•  Many, many processors

•  Cluster of SMPs –  Shared memory addressing

within SMP node –  Message passing between SMP

nodes

–  Can also be regarded as MPP if processor number is large

processor  

memory  

processor  

memory  

processor  

memory  

processor  

memory  

interconnec2on  network  

…   …  

…   …  …  

…  

interconnec2on  network  

M  M  

M  M  

P  P   P   P  

P  P  P  P  

…  

…  

network  interface  

4 Introduction to Parallel Computing, University of Oregon, IPCC

Page 5: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Parallel Architecture Types (3) •  Multicore SMP+GPU Cluster

–  Shared memory addressing within SMP node

–  Message passing between SMP nodes

–  GPU accelerators attached

r  Mul2core  ¦  Mul2core  processor  

   

¦  GPU  accelerator  

 ¦  “Fused” processor  accelerator  

memory  

C   C   C   C  m   m   m   m  

processor  

memory  

PCI  

…   …  

…   …  

…  

…  

interconnec2on  network  

M  M  

M  M  

P  P   P   P  

P  P  P  P  

processor  

memory  

cores  can  be  hardware  mul2threaded  (hyperthread)  

5 Introduction to Parallel Computing, University of Oregon, IPCC

Page 6: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

How do you get parallelism in the hardware? q  Instruction-Level Parallelism (ILP) q  Data parallelism

❍  Increase amount of data to be operated on at same time q  Processor parallelism

❍  Increase number of processors q  Memory system parallelism

❍  Increase number of memory units ❍  Increase bandwidth to memory

q  Communication parallelism ❍  Increase amount of interconnection between elements ❍  Increase communication bandwidth

6 Introduction to Parallel Computing, University of Oregon, IPCC

Page 7: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Instruction-Level Parallelism r  Opportunities for splitting up instruction processing r  Pipelining within instruction r  Pipelining between instructions r  Overlapped execution r  Multiple functional units r  Out of order execution r  Multi-issue execution r  Superscalar processing r  Superpipelining r  Very Long Instruction Word (VLIW) r  Hardware multithreading (hyperthreading)

7 Introduction to Parallel Computing, University of Oregon, IPCC

Page 8: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

CDC  6600  

FPS  AP-­‐120B  

CDC  7600   CRAY-­‐1   CDC  

Cyber-­‐205  

only  scalar  instruc2ons  

issue-­‐when-­‐  ready  

register  to  register  

memory  to  memory  

horizontal  control  

vector  instruc2ons  

Pipelined  Unpipelined  

mul2ple  E  unit  

Vector  

VLIW   IBM  360/91  

ILP  

scoreboarding  

reserva2on  sta2ons  

Parallelism in Single Processor Computers q  History of processor architecture innovation

8 Introduction to Parallel Computing, University of Oregon, IPCC

Page 9: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Vector Processing q  Scalar processing

❍ Processor instructions operate on scalar values ❍  integer registers and floating point registers

q  Vectors ❍ Set of scalar data ❍ Vector registers

◆ integer, floating point (typically) ❍ Vector instructions operate on

vector registers (SIMD) q  Vector unit pipelining q  Multiple vector units q  Vector chaining

Liquid-­‐cooled  with  inert  fluorocarbon.    (That’s  a  waterfall  fountain!!!)  

Cray  2  

9 Introduction to Parallel Computing, University of Oregon, IPCC

Page 10: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Data Parallel Architectures q  SIMD (Single Instruction Multiple Data)

❍ Logical single thread (instruction) of control ❍ Processor associated with data elements

q  Architecture ❍ Array of simple processors with memory ❍ Processors arranged in a regular topology ❍ Control processor issues instructions

◆ All processors execute same instruction (maybe disabled) ❍ Specialized synchronization and communicaton ❍ Specialized reduction operations ❍ Array processing

10

P  E   P  E   P  E  °    °    °  

P  E   P  E   P  E  °    °    °  

P  E   P  E   P  E  °    °    °  

°    °    °   °    °    °   °    °    °  

C  o  n  t  r  o  l  p  r  o  c  e  s  s  o  r  

Introduction to Parallel Computing, University of Oregon, IPCC

Page 11: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

AMT DAP 500 q  Applied Memory Technology (AMT) q  Distributed Array Processor (DAP)

32  

ARRAY  MEMORY  

32  

32K  BITS  

FAST  DATA  CHANNEL  

PROCESSOR  ELEMENTS  

O  

A  

C  

D  

ACCUMULATOR  

ACTIVITY  CONTROL  

CARRY  

DATA  

HOST  HOST  CONNECTION  UNIT  

MASTER  CONTROL  UNIT  

USER  INTERFACE  

CODE  MEMORY  

11 Introduction to Parallel Computing, University of Oregon, IPCC

Page 12: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Thinking Machines Connection Machine (Tucker, IEEE Computer, Aug. 1988) 16,000  processors!!!  

12 Introduction to Parallel Computing, University of Oregon, IPCC

Page 13: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Vector and SIMD Processing Timeline

(a)  Mul)vector  track  

Illiac  IV  (Barnes  et  al,  1968)  

Goodyear  MPP  (Batcher,  1980)  

BSP  (Kuck  and  Stokes,  1982)  

DAP  610  (AMT,  Inc.  1987)  

CM2  (TMC,  1990)  

MasPar  MP1  (Nickolls,  1990)  

IBM  GF/11  (Beetem  et  al,  1985)  

CDC  7600  (CDC,  1970)  

CDC  Cyber  205  (Levine,  1982)  

Cray  1  (Russell,  1978)  

ETA  10  (ETA,  Inc.  1989)  

Cray  Y-­‐MP  (Cray  Research,  1989)  

Cray/MPP  (Cray  Research,  1993)  

Fujitsu,  NEC,  Hitachi  Models  

(b)  SIMD  track  

MasPar  MP2  (1991)  

13 Introduction to Parallel Computing, University of Oregon, IPCC

Page 14: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

What’s the maximum parallelism in a program?

q  “MaxPar: An Execution Driven Simulator for Studying Parallel Systems,” Ding-Kai Chen, M.S. Thesis, University of Illinois, Urbana-Champaign, 1989.

q  Analyze the data dependencies in application execution 512-­‐point  FFT   Flo52  

14 Introduction to Parallel Computing, University of Oregon, IPCC

Page 15: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Dataflow Architectures q  Represent computation as graph of dependencies q  Operations stored in memory until operands are ready q  Operations can be dispatched to processors q  Tokens carry tags of next

instruction to processor q  Tag compared in matching

store q  A match fires execution q  Machine does the hard

parallelization work q  Hard to build correctly

1 b

a

+ − ↔

c e

d

f

Dataflow graph

f = a ↔ d

Network

Tokenstore

WaitingMatching

Instructionfetch

Execute

Token queue

Formtoken

Network

Network

Programstore

a = (b +1) ↔ (b − c)d = c ↔ e

15 Introduction to Parallel Computing, University of Oregon, IPCC

Page 16: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Shared Physical Memory q  Add processors to single processor computer system q  Processors share computer system resources

❍  Memory, storage, … q  Sharing physical memory

❍  Any processor can reference any memory location ❍  Any I/O controller can reference any memory address ❍  Single physical memory address space

q  Operating system runs on any processor, or all ❍ OS see single memory address space ❍ Uses shared memory to coordinate

q  Communication occurs as a result of loads and stores

16 Introduction to Parallel Computing, University of Oregon, IPCC

Page 17: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Caching in Shared Memory Systems q  Reduce average latency

❍  automatic replication closer to processor q  Reduce average bandwidth q  Data is logically transferred from

producer to consumer to memory ❍  store reg → mem ❍  load reg ← mem

q  Processors can share data efficiently q  What happens when store and load are executed on

different processors? q  Cache coherence problems

P P P

17 Introduction to Parallel Computing, University of Oregon, IPCC

Page 18: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Shared Memory Multiprocessors (SMP) q  Architecture types

q  Differences lie in memory system interconnection

P  

M  

P  

M  

P   P  

Single  processor   Mul2ple  processors  

M  

P   P   P  

mul2-­‐port   shared  bus  

P   P   P  

M  interconnec2on  network  

I  /  O    c  t  r  l  M  e  m   M  e  m   M  e  m  

I  n  t  e  r  c  o  n  n  e  c  t  

M  e  m   I  /  O    c  t  r  l  

P  r  o  c  e  s  s  o  r   P  r  o  c  e  s  s  o  r  

I  n  t  e  r  c  o  n  n  e  c  t  

I  /  O  d  e  v  i  c  e  s  

What  does  this  look  like?  

18 Introduction to Parallel Computing, University of Oregon, IPCC

Page 19: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Bus-based SMP q  Memory bus handles all

memory read/write traffic q  Processors share bus q  Uniform Memory Access (UMA)

❍ Memory (not cache) uniformly equidistant ❍ Take same amount of time (generally) to complete

q  May have multiple memory modules ❍  Interleaving of physical address space

q  Caches introduce memory hierarchy ❍ Lead to data consistency problems ❍ Cache coherency hardware necessary (CC-UMA)

P  P  

C  

I/O  

M   M  C  

I/O  

$   $  

19 Introduction to Parallel Computing, University of Oregon, IPCC

Page 20: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Crossbar SMP q  Replicates memory bus for

every processor and I/O controller ❍  Every processor has direct path

q  UMA SMP architecture q  Can still have cache coherency issues q  Multi-bank memory or interleaved memory q  Advantages

❍  Bandwidth scales linearly (no shared links) q  Problems

❍  High incremental cost (cannot afford for many processors) ❍  Use switched multi-stage interconnection network

P  

C  

C  

I/O  

I/O  

M   M  M   M  

P  …  

20 Introduction to Parallel Computing, University of Oregon, IPCC

Page 21: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

“Dance Hall” SMP and Shared Cache q  Interconnection network connects

processors to memory q  Centralized memory (UMA) q  Network determines performance

❍  Continuum from bus to crossbar ❍  Scalable memory bandwidth

q  Memory is physically separated from processors

q  Could have cache coherence problems q  Shared cache reduces coherence

problem and provides fine grained data sharing

M M M ° ° °

° ° °

Network

P

The image can$

P

The image can$

P

The image can$

P 1 Switch

Main memory

P n

(Interleaved)

(Interleaved)

First-level $

21 Introduction to Parallel Computing, University of Oregon, IPCC

Page 22: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

University of Illinios CSRD Cedar Machine q  Center for Supercomputing

Research and Development q  Multi-cluster scalable

parallel computer q  Alliant FX/80

❍  8 processors w/ vectors ❍  Shared cache ❍  HW synchronization

q  Omega switching network q  Shared global memory q  SW-based global memory

coherency

22 Introduction to Parallel Computing, University of Oregon, IPCC

Page 23: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Natural Extensions of the Memory System

P 1 Switch

Main memory

P n

(Interleaved)

(Interleaved)

First-level $

P 1

$

Inter connection network

$

P n

Mem Mem

P 1

$

Inter connection network

$

P n

Mem Mem Shared Cache

Centralized Memory Dance Hall, UMA

Distributed Memory (NUMA)

Scale

Crossbar, Interleaved

23 Introduction to Parallel Computing, University of Oregon, IPCC

Page 24: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Non-Uniform Memory Access (NUMA) SMPs q  Distributed memory q  Memory is physically resident

close to each processor q  Memory is still shared q  Non-Uniform Memory Access (NUMA)

❍  Local memory and remote memory ❍  Access to local memory is faster, remote memory slower ❍  Access is non-uniform ❍  Performance will depend on data locality

q  Cache coherency is still an issue (more serious) q  Interconnection network architecture is more scalable

° ° °

Network

M

P

The image can$ M

P

The image can$ M

P

The image can$

24 Introduction to Parallel Computing, University of Oregon, IPCC

Page 25: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Cache Coherency and SMPs q  Caches play key role in SMP performance

❍  Reduce average data access time ❍  Reduce bandwidth demands placed on shared interconnect

q  Private processor caches create a problem ❍  Copies of a variable can be present in multiple caches ❍  A write by one processor may not become visible to others

◆ they’ll keep accessing stale value in their caches

⇒ Cache coherence problem

q  What do we do about it? ❍  Organize the memory hierarchy to make it go away ❍  Detect and take actions to eliminate the problem

25 Introduction to Parallel Computing, University of Oregon, IPCC

Page 26: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Definitions q  Memory operation (load, store, read-modify-write, …) q  Memory issue is operation presented to memory

system q  Processor perspective

❍  Write: subsequent reads return the value ❍  Read: subsequent writes cannot affect the value

q  Coherent memory system ❍  There exists a serial order of memory operations on each

location such that ◆ operations issued by a process appear in order issued ◆ value returned by each read is that written by previous write

⇒ write propagation + write serialization

26 Introduction to Parallel Computing, University of Oregon, IPCC

Page 27: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Motivation for Memory Consistency q  Coherence implies that writes to a location become

visible to all processors in the same order q  But when does a write become visible? q  How do we establish orders between a write and a

read by different processors? ❍  Use event synchronization

q  Implement hardware protocol for cache coherency q  Protocol will be based on model of memory

consistency P  1   P  2  

/* Assume initial value of A and flag is 0 */  A = 1;   while (flag == 0);  /* spin idly */  flag = 1;   print A;  

27 Introduction to Parallel Computing, University of Oregon, IPCC

Page 28: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Memory Consistency q  Specifies constraints on the order in which memory

operations (from any process) can appear to execute with respect to each other ❍  What orders are preserved? ❍  Given a load, constrains the possible values returned by it

q  Implications for both programmer and system designer ❍  Programmer uses to reason about correctness ❍  System designer can use to constrain how much accesses

can be reordered by compiler or hardware

q  Contract between programmer and system

28 Introduction to Parallel Computing, University of Oregon, IPCC

Page 29: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Sequential Consistency q  Total order achieved by interleaving accesses from

different processes ❍ Maintains program order ❍ Memory operations (from all processes) appear to

issue, execute, and complete atomically with respect to others

❍ As if there was a single memory (no cache) “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979]

29 Introduction to Parallel Computing, University of Oregon, IPCC

Page 30: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Sequential Consistency (Sufficient Conditions)

q  There exist a total order consistent with the memory operations becoming visible in program order

q  Sufficient Conditions ❍  every process issues memory operations in program order ❍  after write operation is issued, the issuing process waits for

write to complete before issuing next memory operation (atomic writes)

❍  after a read is issued, the issuing process waits for the read to complete and for the write whose value is being returned to complete (globally) before issuing its next memory operation

q  Cache-coherent architectures implement consistency

30 Introduction to Parallel Computing, University of Oregon, IPCC

Page 31: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Bus-based Cache-Coherent (CC) Architecture q  Bus Transactions

❍ Single set of wires connect several devices ❍ Bus protocol: arbitration, command/addr, data ❍ Every device observes every transaction

q  Cache block state transition diagram ❍ FSM specifying how disposition of block changes

◆ invalid, valid, dirty ❍ Snoopy protocol

q  Basic Choices ❍ Write-through vs Write-back ❍  Invalidate vs. Update

31 Introduction to Parallel Computing, University of Oregon, IPCC

Page 32: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Snoopy Cache-Coherency Protocols

q  Bus is a broadcast medium q  Caches know what they have q  Cache controller “snoops” all transactions on shared

bus ❍  relevant transaction if for a block its cache contains ❍  take action to ensure coherence

◆ invalidate, update, or supply value ❍  depends on state of the block and the protocol

State Address Data

I/O devicesMem

P1

$

Bus snoop

$

Pn

Cache-memorytransaction

32 Introduction to Parallel Computing, University of Oregon, IPCC

Page 33: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Example: Write-back Invalidate

I/O devices

Memory

P 1

$ $ $

P 2 P 3

5 u = ?

4

u = ?

u :5 1

u :5

2

u :5

3

u = 7

33 Introduction to Parallel Computing, University of Oregon, IPCC

Page 34: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Intel Pentium Pro Quad Processor

❍  All coherence and multiprocessing glue in processor module

❍  Highly integrated, targeted at high volume

❍  Low latency and bandwidth

P  -  P  r  o    b  u  s    (  6  4  -  b  i  t    d  a  t  a  ,    3  6  -  b  i  t    a  d  d  r  e  s  s  ,    6  6    M  H  z  )  

C  P  U  

B  u  s    i  n  t  e  r  f  a  c  e  

M  I  U  

P  -  P  r  o  m  o  d  u  l  e   P  -  P  r  o  

m  o  d  u  l  e   P  -  P  r  o  m  o  d  u  l  e  2  5  6  -  K  B  

L  2    $  I  n  t  e  r  r  u  p  t  c  o  n  t  r  o  l  l  e  r  

P  C  I  b  r  i  d  g  e   P  C  I  

b  r  i  d  g  e   M  e  m  o  r  y  c  o  n  t  r  o  l  l  e  r  

1  -  ,    2  -  ,    o  r    4  -  w  a  y  i  n  t  e  r  l  e  a  v  e  d    

D  R  A  M  P  C  I    

P  C  I    P  C  I  I  /  O  

c  a  r  d  s  

34 Introduction to Parallel Computing, University of Oregon, IPCC

Page 35: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Natural Extensions of the Memory System

P 1 Switch

Main memory

P n

(Interleaved)

(Interleaved)

First-level $

P 1

$

Inter connection network

$

P n

Mem Mem

P 1

$

Inter connection network

$

P n

Mem Mem Shared Cache

Centralized Memory Dance Hall, UMA

Distributed Shared Memory (NUMA)

Scale

Crossbar, Interleaved

35 Introduction to Parallel Computing, University of Oregon, IPCC

Page 36: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Memory Consistency q  Specifies constraints on the order in which memory

operations (from any process) can appear to execute with respect to each other ❍  What orders are preserved? ❍  Given a load, constrains the possible values returned by it

q  Implications for both programmer and system designer ❍  Programmer uses to reason about correctness ❍  System designer can use to constrain how much accesses

can be reordered by compiler or hardware q  Contract between programmer and system q  Need coherency systems to enforce memory

consistency 36 Introduction to Parallel Computing, University of Oregon, IPCC

Page 37: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Context for Scalable Cache Coherence

° ° °

Scalable network

CA

P

$

Switch

M

Switch Switch

Realizing programming models through net transaction protocols - efficient node-to-net interface - interprets transactions

Caches naturally replicate data - coherence through bus - snooping protocols - consistency

Scalable Networks - many simultaneous transactions

Scalable distributed memory

Need cache coherence protocols that scale! - no broadcast or single point of order

37 Introduction to Parallel Computing, University of Oregon, IPCC

Page 38: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Generic Solution: Directories

q  Maintain state vector explicitly ❍  associate with memory block ❍  records state of block in each cache

q  On miss, communicate with directory ❍  determine location of cached copies ❍  determine action to take ❍  conduct protocol to maintain coherence

P1

Cache

Memory

Scalable Interconnection Network

Comm.Assist

P1

Cache

CommAssist

Directory MemoryDirectory

38 Introduction to Parallel Computing, University of Oregon, IPCC

Page 39: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Requirements of a Cache Coherent System q  Provide set of states, state transition diagram, and

actions q  Manage coherence protocol

❍  (0) Determine when to invoke coherence protocol ❍  (a) Find info about state of block in other caches to determine action

◆ whether need to communicate with other cached copies ❍  (b) Locate the other copies ❍  (c) Communicate with those copies (inval/update)

q  (0) is done the same way on all systems ❍  state of the line is maintained in the cache ❍  protocol is invoked if an “access fault” occurs on the line

q  Different approaches distinguished by (a) to (c) 39 Introduction to Parallel Computing, University of Oregon, IPCC

Page 40: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Bus-base Cache Coherence q  All of (a), (b), (c) done through broadcast on bus

❍  faulting processor sends out a “search” ❍ others respond to the search probe and take necessary

action q  Could do it in scalable network too q  Conceptually simple, but broadcast doesn’t scale

❍ on bus, bus bandwidth doesn’t scale ❍ on scalable network, every fault leads to at least p

network transactions q  Scalable coherence:

❍  can have same cache states and state transition diagram ❍ different mechanisms to manage protocol

40 Introduction to Parallel Computing, University of Oregon, IPCC

Page 41: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Basic Snoop Protocols q  Write Invalidate :

❍ Multiple readers, single writer ❍ Write to shared data: an invalidate is sent to all caches ❍ Read Miss:

◆ Write-through: memory is always up-to-date ◆ Write-back: snoop in caches to find most recent copy

q  Write Broadcast (typically write through): ❍ Write to shared data: broadcast on bus, snoop and

update q  Write serialization: bus serializes requests! q  Write Invalidate versus Broadcast

41 Introduction to Parallel Computing, University of Oregon, IPCC

Page 42: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Snooping Cache Variations

Berkeley Protocol Owned Exclusive Owned Shared Shared Invalid

Basic Protocol Exclusive Shared Invalid

Illinois Protocol Private Dirty Private Clean Shared Invalid

Owner can update via bus invalidate operation Owner must write back when replaced in cache

If read sourced from memory, then Private Clean if read sourced from other cache, then Shared Can write in cache if held private clean or dirty

MESI Protocol Modified (private,!=Memory) Exclusive (private,=Memory) Shared (shared,=Memory) Invalid

42 Introduction to Parallel Computing, University of Oregon, IPCC

Page 43: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Scalable Approach: Directories q  Every memory block has associated directory

information ❍ keeps track of copies of cached blocks and their states ❍ on a miss, find directory entry, look it up, and

communicate only with the nodes that have copies if necessary

❍  in scalable networks, communication with directory and copies is through network transactions

q  Many alternatives for organizing directory information

43 Introduction to Parallel Computing, University of Oregon, IPCC

Page 44: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Basic Operation of Directory

q  k processors q  Each cache-block in memory

❍ k presence bits and 1 dirty bit q  Each cache-block in cache

❍ 1 valid bit and 1 dirty (owner) bit

• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

r  Read  from  memory  ¦  Dirty  bit  OFF  ¦  Dirty  bit  ON  

r Write  to  memory  ¦  Dirty  bit  OFF  

44 Introduction to Parallel Computing, University of Oregon, IPCC

Page 45: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

DASH Cache-Coherent SMP q  Directory

Architecture for Shared Memory

q  Stanford research project (early 1990s) for studying how to build cache- coherent shared memory architectures

q  Directory-based cache coherency q  D. Lenoski, et al., “The Stanford Dash

Multiprocessor,” IEEE Computer, Volume 25 Issue 3, pp: 63-79, March 1992

45 Introduction to Parallel Computing, University of Oregon, IPCC

Page 46: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Sequent NUMA-Q

q  Ring-based SCI network ❍ 1 GB/second ❍ Built-in coherency

q  Commodity SMPs as building blocks ❍ Extend coherency mechanism

q  Split transaction bus

IQ-Link

PCII/O

PCII/O Memory

P6 P6 P6 P6

QuadQuad

QuadQuad

QuadQuad

I/O device

I/O device

I/O device

Mgmt. and Diag-nostic Controller

46 Introduction to Parallel Computing, University of Oregon, IPCC

Page 47: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

SGI Origin 2000 q  Scalable shared

memory multiprocessor q  MIPS R10000 CPU q  NUMAlink router q  Directory-based cache

coherency (MESI) q  ASCI Blue Mountain

L2 cache

P

(1-4 MB)L2 cache

P

(1-4 MB)

Hub

Main Memory(1-4 GB)

Direc-tory

L2 cache

P

(1-4 MB)L2 cache

P

(1-4 MB)

Hub

Main Memory(1-4 GB)

Direc-tory

Interconnection Network

SysAD busSysAD bus

47 Introduction to Parallel Computing, University of Oregon, IPCC

Page 48: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Distributed Memory Multiprocessors q  Each processor has a local memory

❍  Physically separated memory address space q  Processors must communicate to access non-local data

❍  Message communication (message passing) ◆ Message passing architecture

❍  Processor interconnection network

q  Parallel applications must be partitioned across ❍  Processors: execution units ❍  Memory: data partitioning

q  Scalable architecture ❍  Small incremental cost to add hardware (cost of node)

48 Introduction to Parallel Computing, University of Oregon, IPCC

Page 49: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Distributed Memory (MP) Architecture q  Nodes are complete

computer systems ❍  Including I/O

q  Nodes communicate via interconnection network ❍ Standard networks ❍ Specialized networks

q  Network interfaces ❍ Communication integration

q  Easier to build

Network

M The image cannot be displayed. Your computer may not have enough memory to

$

P

M The image cannot be displayed. Your computer may not have enough memory to

$

P

M The image cannot be displayed. Your computer may not have enough memory to

$

P

Network  interface  

49 Introduction to Parallel Computing, University of Oregon, IPCC

Page 50: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Network Performance Measures

50 Introduction to Parallel Computing, University of Oregon, IPCC

Page 51: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Performance Metrics: Latency and Bandwidth

q  Bandwidth ❍  Need high bandwidth in communication ❍  Match limits in network, memory, and processor ❍  Network interface speed vs. network bisection bandwidth

q  Latency ❍  Performance affected since processor may have to wait ❍  Harder to overlap communication and computation ❍  Overhead to communicate is a problem in many machines

q  Latency hiding ❍  Increases programming system burden ❍  Examples: communication/computation overlaps, prefetch

51 Introduction to Parallel Computing, University of Oregon, IPCC

Page 52: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Scalable, High-Performance Interconnect q  Interconnection network is core of parallel

architecture q  Requirements and tradeoffs at many levels

❍ Elegant mathematical structure ❍ Deep relationship to algorithm structure ❍ Hardware design sophistication

q  Little consensus ❍ Performance metrics? ❍ Cost metrics? ❍ Workload? ❍ …

52 Introduction to Parallel Computing, University of Oregon, IPCC

Page 53: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

What Characterizes an Interconnection Network?

q  Topology (what) ❍  Interconnection structure of the network graph

q  Routing Algorithm (which) ❍ Restricts the set of paths that messages may follow ❍ Many algorithms with different properties

q  Switching Strategy (how) ❍ How data in a message traverses a route ❍  circuit switching vs. packet switching

q  Flow Control Mechanism (when) ❍ When a message or portions of it traverse a route ❍ What happens when traffic is encountered?

53 Introduction to Parallel Computing, University of Oregon, IPCC

Page 54: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Topological Properties q  Routing distance

❍ Number of links on route from source to destination q  Diameter

❍ Maximum routing distance q  Average distance q  Partitioned network

❍ Removal of links resulting in disconnected graph ❍ Minimal cut

q  Scaling increment ❍ What is needed to grow the network to next valid

degree 54 Introduction to Parallel Computing, University of Oregon, IPCC

Page 55: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Interconnection Network Types

Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024

1D Array 2 N-1 N / 3 1 huge

1D Ring 2 N/2 N/4 2

2D Mesh 4 2 (N1/2 - 1) 2/3 N1/2 N1/2 63 (21)

2D Torus 4 N1/2 1/2 N1/2 2N1/2 32 (16)

k-ary n-cube 2n nk/2 nk/4 nk/4 15 (7.5) @n=3

Hypercube n =log N n n/2 N/2 10 (5)

N  =  #  nodes  

55 Introduction to Parallel Computing, University of Oregon, IPCC

Page 56: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Communication Performance q  Time(n)s-d = overhead + routing delay + channel

occupancy + contention delay

q  occupancy = (n + nh) / b

n = message #bytes nh = header #bytes b = bitrate of communication link

q  What is the routing delay? q  Does contention occur and what is the cost?

56 Introduction to Parallel Computing, University of Oregon, IPCC

Page 57: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Store-and-Forward vs. Cut-Through Routing

q  h(n/b + Δ) n/b + h Δ q  What if message is fragmented? q  Wormhole vs. Virtual cut-through"

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1

023

3 1 0

2 1 0

23 1 0

0

1

2

3

23 1 0Time

Store & Forward Routing Cut-Through Routing

Source Dest Dest

57 Introduction to Parallel Computing, University of Oregon, IPCC

Page 58: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Networks of Real Machines (circa 2000)

Machine Topology Speed Width Delay Flit nCUBE/2 hypercube 25 ns 1 40 cycles 32 CM-5 fat-tree 25 ns 4 10 cycles 4 SP-2 banyan 25 ns 8 5 cycles 16 Paragon 2D mesh 11.5 ns 16 2 cycles 16 T3D 3D torus 6.67 ns 16 2 cycles 16 DASH torus 30 ns 16 2 cycles 16 Origin hypercube 2.5 ns 20 16 cycles 160 Myrinet arbitrary 6.25 ns 16 50 cycles 16

A  message  is  broken  up  into  flits  for  transfer.      

58 Introduction to Parallel Computing, University of Oregon, IPCC

Page 59: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Message Passing Model q  Hardware maintains send and receive message buffers q  Send message (synchronous)

❍  Build message in local message send buffer ❍  Specify receive location (processor id) ❍  Initiate send and wait for receive acknowledge

q  Receive message (synchronous) ❍  Allocate local message receive buffer ❍  Receive message byte stream into buffer ❍  Verify message (e.g., checksum) and send acknowledge

q  Memory to memory copy with acknowledgement and pairwise synchronization

59 Introduction to Parallel Computing, University of Oregon, IPCC

Page 60: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Advantages of Shared Memory Architectures q  Compatibility with SMP hardware q  Ease of programming when communication patterns

are complex or vary dynamically during execution q  Ability to develop applications using familiar SMP

model, attention only on performance critical accesses q  Lower communication overhead, better use of BW for

small items, due to implicit communication and memory mapping to implement protection in hardware, rather than through I/O system

q  HW-controlled caching to reduce remote communication by caching of all data, both shared and private

60 Introduction to Parallel Computing, University of Oregon, IPCC

Page 61: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Advantages of Distributed Memory Architectures

q  The hardware can be simpler (especially versus NUMA) and is more scalable

q  Communication is explicit and simpler to understand

q  Explicit communication focuses attention on costly aspect of parallel computation

q  Synchronization is naturally associated with sending messages, reducing the possibility for errors introduced by incorrect synchronization

q  Easier to use sender-initiated communication, which may have some advantages in performance

61 Introduction to Parallel Computing, University of Oregon, IPCC

Page 62: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Clusters of SMPs q  Clustering

❍  Integrated packaging of nodes q  Motivation

❍  Ammortize node costs by sharing packaging and resources

❍  Reduce network costs ❍  Reduce communications bandwidth requirements ❍  Reduce overall latency ❍  More parallelism in a smaller space ❍  Increase node performance

q  Scalable parallel systems today are built as SMP clusters

…   …  

…  

…  

…  

interconnec2on  network  

M  M  

M  M  

P  P   P   P  

P  P  P  P  

62 Introduction to Parallel Computing, University of Oregon, IPCC

Page 63: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

CalTech Cosmic Cube q  First distributed memory

message passing system q  Hypercube-based

communications network

q  Chuck Seitz, Geoffrey Fox

000001

010011

100

110

101

111

Compute  node  FIFO  on  each  link  -­‐  store  and  forward  

63 Introduction to Parallel Computing, University of Oregon, IPCC

Page 64: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Intel iPSC/1, iPSC/2, iPSC/860 q  Shift to general links

❍ DMA, enabling non-blocking ops ◆ Buffered by system at

destination until recv

❍ Store&forward routing q  Diminishing role of topology

❍ Any-to-any pipelined routing ❍ node-network interface

dominates communication time ❍ Simplifies programming ❍ Allows richer design space

64 Introduction to Parallel Computing, University of Oregon, IPCC

Page 65: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

HyperMon Architecture (Who built this?) q  Develop hardware support for tracing

❍ Reduces intrusion trace buffering and I/O q  Hardware design

❍ Memory-mapped interface ❍ Synchronized timers and automatic timestamping ❍ Support for event bursts and off-line streaming

65 Introduction to Parallel Computing, University of Oregon, IPCC

Page 66: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Intel Paragon and ASCI Red

M  e  m  o  r  y    b  u  s    (  6  4  -  b  i  t  ,    5  0    M  H  z  )  

i  8  6  0  L  1    $  

N  I  

D  M  A  

i  8  6  0  L  1    $  

D  r  i  v  e  r  M  e  m  c  t  r  l  

4  -  w  a  y  i  n  t  e  r  l  e  a  v  e  d  

D  R  A  M  

I  n  t  e  l  P  a  r  a  g  o  n  n  o  d  e  

8    b  i  t  s  ,  1  7  5    M  H  z  ,  b  i  d  i  r  e  c  t  i  o  n  a  l  

q  DARPA project machine ❍  Intel i860 processor ❍  2D grid network with processor

node attached to every switch ❍  8bit, 175 MHz bidirectional links

q  Forerunner design for ASCI Red ❍  First Teraflop computer

66 Introduction to Parallel Computing, University of Oregon, IPCC

Page 67: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Thinking Machine CM-5 q  Repackaged SparcStation

❍ 4 per board q  Fat-Tree network q  Control network for

global synchronization q  Suffered from hardware

design and installation problems

67 Introduction to Parallel Computing, University of Oregon, IPCC

Page 68: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Berkeley Network Of Workstations (NOW) q  100 Sun Ultra2

workstations q  Inteligent network

interface ❍ proc + mem

q  Myrinet network ❍ 160 MB/s per link ❍ 300 ns per hop

68 Introduction to Parallel Computing, University of Oregon, IPCC

Page 69: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Cray T3E q  Up to 1024 nodes q  3D torus network

❍ 480 MB/s links q  No memory coherence q  Access remote memory

❍ Converted to messages ❍ SHared MEMory

communication ◆ put / get operations

q  Very successful machine

Switch

P$

XY

Z

Exter nal I/O

Memctrl

and NI

Mem

69 Introduction to Parallel Computing, University of Oregon, IPCC

Page 70: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

M  e  m  o  r  y    b  u  s  

M  i  c  r  o  C  h  a  n  n  e  l    b  u  s  

I  /  O  i  8  6  0   N  I  

D  M  A  D  R  A  M  

I  B  M    S  P  -  2    n  o  d  e  

IBM SP-2 q  Made out of essentially complete

RS6000 workstations q  Network interface integrated in I/O bus q  SP network very advanced

❍ Formed from 8-port switches

q  Predecessor design to ❍ ASCI Blue Pacific (5856 CPUs) ❍ ASCI White (8192 CPUs)

L  2    $  

P  o  w  e  r    2  C  P  U  

M  e  m  o  r  y  c  o  n  t  r  o  l  l  e  r  

4  -  w  a  y  i  n  t  e  r  l  e  a  v  e  d  

D  R  A  M  

N  I  C  

70 Introduction to Parallel Computing, University of Oregon, IPCC

Page 71: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

NASA Columbia

q  System hardware ❍  20 SGI Altix 3700 superclusters

◆ 512 Itanium2 processors (1.5 GHz) ◆ 1 TB memory

❍  10,240 processors (now 13,312) ❍  NUMAflex architecture ❍  NUMAlink “fat tree” network ❍  Fully shared memory!!!

q  Software ❍  Linux with PBS Pro job scheduling ❍  Intel Fortran/C/C++ compilers

71 Introduction to Parallel Computing, University of Oregon, IPCC

Page 72: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

SGI Altix UV q  Latest generation scalable

shared memory architecture q  Scaling from 32 to 2,048

cores ❍  Intel Nehalem EX

q  Architectural provisioning for up to 262,144 cores

q  Up to 16 terabytes of global shared memory in a single system image (SSI)

q  High-speed 15GB per second interconnect NUMAlink 5

72 Introduction to Parallel Computing, University of Oregon, IPCC

Page 73: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Sandia Red Storm q  System hardware

❍  Cray XT3 ❍  135 compute node cabinets ❍  12,960 processors

◆ AMD Opteron dual-core ❍  320 / 320 service / I/O node processors ❍  40 TB memory ❍  340 TB disk ❍  3D mesh interconnect

q  Software ❍  Catamount compute node kernel ❍  Linux I/O node kernel ❍  MPI

73 Introduction to Parallel Computing, University of Oregon, IPCC

Page 74: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

LLNL BG/L q  System hardware

❍  IBM BG/L (BlueGene) ❍  65,536 dual-processor compute nodes

◆ PowerPC processors ◆ “double hummer” floating point

❍  I/O node per 32 compute nodes ❍  32x32x64 3D torus network ❍  Global reduction tree ❍  Global barrier and interrupt networks ❍  Scalable tree network for I/O

q  Software ❍  Compute node kernel (CNK) ❍  Linux I/O node kernel (ION) ❍  MPI ❍  Different operating modes

74 Introduction to Parallel Computing, University of Oregon, IPCC

Page 75: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Tokyo Institute of Technology TSUBAME q  System hardware

❍  655 Sun Fire X4600 servers ❍  11,088 processors

◆ AMD Opteron dual-core ❍  ClearSpeed accelerator ❍  InfiniBand network ❍  21 TB memory ❍  42 Sun Fire X4500 servers ❍  1 PB of storage space

q  Software ❍  SuSE Linux Enterprise Server 9 SP3 ❍  Sun N1 Grid Engine 6.0 ❍  Lustre Client Software

75 Introduction to Parallel Computing, University of Oregon, IPCC

Page 76: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Tokyo Institute of Technology TSUBAME2

76 Introduction to Parallel Computing, University of Oregon, IPCC

Page 77: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

TSUBAME2 – Interconnect

77 Introduction to Parallel Computing, University of Oregon, IPCC

Page 78: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Japanese K Computer – Interconnect q  80,000 CPUs (SPARC64 VIIIfx), 640,000 cores q  800 racks q  8.6 Petaflops (Linpack)

78 Introduction to Parallel Computing, University of Oregon, IPCC

Page 79: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Japanese K Computer – Interconnect

12  links    

79 Introduction to Parallel Computing, University of Oregon, IPCC

Page 80: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

ORNL Titan (http://www.olcf.ornl.gov/titan) q  Cray XK7

❍ 18,688 nodes ❍ AMD Opteron

◆ 16-core Interlagos ◆ 299,008 Opteron cores

❍ NVIDIA K20x ◆ 18,688 GPUs ◆ 50,233,344 GPU cores

q  Gemini interconnect ❍ 3D torus

q  20+ petaflops

80 Introduction to Parallel Computing, University of Oregon, IPCC

Page 81: Parallel Computer Architecture - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-2-architecture.pdf · Lecture 2 – Parallel Architecture Parallel Computer Architecture

Lecture 2 – Parallel Architecture

Next Class q  Parallel performance models

81 Introduction to Parallel Computing, University of Oregon, IPCC