architecture and compilers part iicache-only memory architecture (coma) large cache per processor to...

Architecture and Compilers PART II

HPC Fall 2010 Prof. Robert van Engelen

HPC Fall 2010 2 10/10/10

Overview

  The PMS model   Shared memory multiprocessors

  Basic shared memory systems   SMP, Multicore, and COMA

  Distributed memory multicomputers   MPP systems   Network topologies for message-passing multicomputers   Distributed shared memory

  Pipeline and vector processors   Comparison   Taxonomies

HPC Fall 2010 3 10/10/10

PMS Architecture Model   Processor (P)

  A device that performs operations on data

  Memory (M)   A device that stores

data

  Switch (S)   A device that facilitates

transfer of data between devices

  Arcs denote connectivity

Example: a computer system with CPU and peripherals

A simple PMS model

HPC Fall 2010 4 10/10/10

Shared Memory Multiprocessor   Processors access shared memory via a common switch, e.g. a bus

  Problem: a single bus results in a bottleneck   Shared memory has a single address space   Architecture sometimes referred to as a “dance hall”

HPC Fall 2010 5 10/10/10

Shared Memory: the Bus Contention Problem

  Each processor competes for access to shared memory   Fetching instructions   Loading and storing data

  Bus contention   Access to memory is restricted to one processor at a time   This limits the speedup and scalability with respect to the

number of processors   Assume that each instruction requires 0<m<1 memory operations

(the ratio load-store per instruction), F instructions are performed per unit of time, and a maximum of W words can be moved over the bus per unit of time, then SP < W / (m F)

regardless of the number of processors P   In other words, the parallel efficiency is limited unless

P < W / (m F)

HPC Fall 2010 6 10/10/10

Shared Memory: Work-Memory Ratio

  Work-memory ratio (FP:M ratio): ratio of the number of floating point operations to the number of distinct memory locations referenced in the innermost loop:   Same location is counted just

once in innermost loop   Assumes effective use of

registers (and cache) in innermost loop for reuse

  Assumes no reuse across outer loops (registers/cache use saturated in inner loop)

  Note that FP:M = m-1 –1 so efficient utilization of shared memory multiprocessors requires P < (FP:M+1) × W / F

for (i=0; i<1000; i++) x = x + i;

for (i=0; i<N; i++) x = x + a[i]*b[i];

2 distinct memory locations and one float add:

FP:M = 500

2N+1 distinct memory locations and 2N FP operations:

FP:M = 1 when N is large

HPC Fall 2010 7 10/10/10

Shared Memory Multiprocessor with Local Cache

  Add local cache the improve performance when W / F is small   With today’s systems we have W / F << 1

  Problem: how to ensure cache coherence?

HPC Fall 2010 8 10/10/10

Shared Memory: Cache Coherence

  A cache coherence protocol ensures that processors obtain newly altered data when shared data is modified by another processor

  Because caches operate on cache lines, more data than the shared object alone can be effected, which may lead to false sharing

Thread 1 modifies shared data

Thread 0 reads modified shared data

Cache coherence protocol ensures that thread 0 obtains the newly altered data

HPC Fall 2010 9 10/10/10

COMA   Cache-only memory architecture (COMA)   Large cache per processor to replace shared memory   A data item is either in one cache (non-shared) or in multiple caches

(shared)   Switch includes an engine that provides a single global address

space and ensures cache coherence

HPC Fall 2010 10 10/10/10

Distributed Memory Multicomputer

  Massively parallel processor (MPP) systems with P > 1000   Communication via message passing   Nonuniform memory access (NUMA)   Network topologies

  Mesh   Hypercube   Cross-bar switch

HPC Fall 2010 11 10/10/10

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

10 20 30 40 50 60 70 80 90 100

n

Speedup

Computation-Communication Ratio

  The computation-communication ratio:

tcomp / tcomm   Usually assessed analytically

and/or measured empirically   High communication overhead

decreases speedup, so ratio should be as high as possible

  For example: data size n, number of processors P, and ratio tcomp / tcomm = 1000n / 10n2

0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60 70 80 90 100

n

com

puta

tion

/

com

mun

icat

ion tcomp / tcomm = 1000n / 10n2

P=1

P=2

P=4

P=8

SP = ts / tP = 1000n / (1000n / P + 10n2)

HPC Fall 2010 12 10/10/10

Mesh Topology

  Network of P nodes has mesh size √P × √P

  Diameter 2 × (√P -1)

  Torus network wraps the ends   Diameter √P -1

HPC Fall 2010 13 10/10/10

Hypercube Topology   d-dimensional hypercube has

P = 2d nodes

  Diameter is d = log2 P

  Node addressing is simple   Node number of nearest

neighbor node differs in one bit

  Routing algorithm flips bits to determine possible paths, e.g. from node 001 to 111 has two shortest paths

  001 → 011 → 111   001 → 101 → 111

d=2

d=4

d=3

HPC Fall 2010 14 10/10/10

Cross-bar Switches

Cross-bar switch

  Processors and memories are connected by a set of switches

  Enables simultaneous (contention free) communication between processor i and σ(i), where σ is an arbitrary permutation of 1…P

σ(1)=2 σ(2)=1 σ(3)=3

HPC Fall 2010 15 10/10/10

Multistage Interconnect Network

  Each switch has an upper output (0) and a lower output (1)

  A message travels through switch based on destination address   Each bit in destination address is

used to control a switch from start to destination

  For example, from 001 to 100   First switch selects lower output

(1)   Second switch selects upper

output (0)   Third switch selects upper output

(0)

  Contention can occur when two messages are routed through the same switch

8×8 three-stage interconnect

4×4 two-stage interconnect

HPC Fall 2010 16 10/10/10

Distributed Shared Memory   Distributed shared memory (DSM) systems use physically

distributed memory modules and a global address space that gives the illusion of shared virtual memory

  Hardware is used to automatically translate a memory address into a local address or a remote memory address (via message passing)

  Software approaches add a programming layer to simplify access to shared objects (hiding the communications)

HPC Fall 2010 17 10/10/10

Pipeline and Vector Processors   Vector processors run operations

on multiple data elements simultaneously

  Vector processor has a maximum vector length, e.g. 512

  Strip mining the loop results in an outer loop with stride 512 to enable vectorization of longer vector operations

  Pipelined vector architectures dispatch multiple vector operations per clock cycle

  Vector chaining allows the result of a previous vector operation to be directly fed into the next operation in the pipeline

DO i = 0,9999 z(i) = x(i) + y(i)

DO j = 0,18,512 DO i = 0,511 z(j+i) = x(j+i) + y(j+i) ENDDO ENDDO DO i = 0,271 z(9728+i) = x(9728+i) + y(9728+i) ENDDO

DO j = 0,18,512 z(j:j+511) = x(j:j+511) + y(j:j+511) ENDDO z(9728:9999) = x(9728:9999) + y(9728:9999)

HPC Fall 2010 18 10/10/10

Comparison: Bandwidth, Latency and Capacity

HPC Fall 2010 19 10/10/10

Flynn’s Taxonomy   Single instruction stream single

data stream (SISD)   Traditional PC system

  Single instruction stream multiple data stream (SIMD)   Similar to MMX/SSE/AltiVec

multimedia instruction sets   MASPAR

  Multiple instruction stream multiple data stream (MIMD)   Single program, multiple data

(SPMD) programming: each processor executes a copy of the program

MIMD

SISD SIMD

Data stream

Inst

ruct

ion

stre

am

single

sing

le

mul

tiple

multiple

HPC Fall 2010 20 10/10/10

Task Parallelism versus Data Parallelism

  Task parallelism, MIMD   Fork-join model with thread-level parallelism and shared memory   Message passing model with (distributed processing) processes

  Data parallelism, SIMD   Multiple processors (or units) operate on segmented data set   SIMD model with vector and pipeline machines   SIMD-like multi-media extensions, e.g. MMX/SSE/Altivec

X3 X2 X1 X0

Y3 Y2 Y1 Y0

X3 ⊕ Y3 X2 ⊕ Y2 X1 ⊕ Y1 X0 ⊕ Y0

⊕ ⊕ ⊕ ⊕

src1

src2

dest

Vector operation X[0:3] ⊕ Y[0:3] with SSE instruction on Pentium-4

HPC Fall 2010 21 10/10/10

Further Reading

  [PP2] pages 13-26   [SPC] pages 71-95   [HPC] pages 25-28   Optional:

  [SRC] pages 15-42

architecture and compilers part iicache-only memory architecture (coma) large cache per processor to...

Documents