multiprocessor systems. 2 old cw: power is free, transistors expensive new cw: “power wall”...

23
Multiprocessor Systems

Upload: henry-armstrong

Post on 19-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Multiprocessor Systems

Page 2: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

2

• Old CW: Power is free, Transistors expensive• New CW: “Power wall” Power expensive, Xtors free

(Can put more on chip than can afford to turn on)• Old: Multiplies are slow, Memory access is fast• New: “Memory wall” Memory slow, multiplies fast

(200 clocks to DRAM memory, 4 clocks for FP multiply)• Old : Increasing Instruction Level Parallelism via compilers,

innovation (Out-of-order, speculation, VLIW, …)• New CW: “ILP wall” diminishing returns on more ILP • New: Power Wall + Memory Wall + ILP Wall = Brick Wall

• Old CW: Uniprocessor performance 2X / 1.5 yrs• New CW: Uniprocessor performance only 2X / 5 yrs?

Conventional Wisdom (CW) in Computer Architecture

Page 3: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

3

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Pe

rfo

rma

nce

(vs

. V

AX

-11

/78

0)

25%/year

52%/year

??%/year

Uniprocessor Performance (SPECint)

• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Sea change in chip design: multiple “cores” or processors per chip

3X

Page 4: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Evolution from Single Core to Multi-Core“… today’s processors … are nearing an impasse as technologies approach the

speed of light..” David Mitchell, The Transputer: The Time Is Now (1989)

Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. …

This is a sea change in computing” Paul Otellini, President, Intel (2005)

• All microprocessor companies switch to MP (2X CPUs / 2 yrs) Procrastination penalized: 2X sequential perf. / 5 yrsManufacturer/Year AMD/’05 Intel/’06 IBM/’04 Sun/’05Processors/chip 2 2 2 8Threads/Processor 1 2 2 4

Threads/chip 2 4 4 32

  Procrastination : to keep delaying something that must be done

Page 5: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

5

Flynn’s Taxonomy• Flynn classified by data and control streams in 1966

• SIMD Data Level Parallelism• MIMD Thread Level Parallelism• MIMD popular because

• Flexible: N pgms and 1 multithreaded pgm• Cost-effective: same MPU in desktop & MIMD

Single Instruction Single Data (SISD)(Uniprocessor)

Single Instruction Multiple Data SIMD(single PC: Vector, CM-2)

Multiple Instruction Single Data (MISD)(????)

Multiple Instruction Multiple Data MIMD(Clusters, SMP servers)

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

Page 6: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Flynn’s Taxonomy• SISD (Single Instruction Single Data)

• Uniprocessors

• MISD (Multiple Instruction Single Data)• Single data stream operated by successive functional units

• SIMD (Single Instruction Multiple Data)• Instruction stream executed by multiple processors on different data

• Simple programming model, low overhead• Examples: Connection Machine and Vector Processors

• MIMD (Multiple Instruction Multiple Data) is the most general• Flexibility for parallel applications and multi-programmed systems

• Processors can run independent processes or applications• Processors can also run threads belonging to one parallel application

• Uses off-the-shelf microprocessors and components

Page 7: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Major MIMD Styles• Symmetric Multiprocessors (SMP)

• Main memory is shared and equally accessible by all processors• Called also Uniform Memory Access (UMA)• Bus based or interconnection network based

• Distributed memory multiprocessors• Distributed Shared Memory (DSM) multiprocessors

• Distributed memories are shared and can be accessed by all processors• Non-uniform memory access (NUMA)• Latency varies between local and remote memory access

• Message-Passing multiprocessors, multi-computers, and clusters• Distributed memories are NOT shared• Each processor can access its own local memory• Processors communicate by sending and receiving messages

Page 8: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Shared Memory Architecture• Any processor can directly reference any physical

memory• Any I/O controller to any physical memory• Operating system can run on any processor

• OS uses shared memory to coordinate

• Communication occurs implicitly as result of loads and stores

• Wide range of scale• Few to hundreds of processors• Memory may be physically

distributed among processors

• History dates to early 1960s

Shared PhysicalMemory

Processor

Processor

I/O I/O

I/O

Processor

ProcessorProcessor

Page 9: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Shared Memory OrganizationsP1

$

Interconnection network

$

Pn

Mem Mem

Dance Hall (UMA)

P1

$

Interconnection network

$

Pn

Mem Mem

Distributed Shared Memory (NUMA)

P1

$ $

Pn

Mem I/O devices

Bus-based Shared Memory

P1

Switch

Main memory

Pn

Interleaved

Interleaved

Cache

Shared Cache

Page 10: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Bus-Based Symmetric Multiprocessors

• Symmetric access to main memory from any processor• Dominate the server market

• Building blocks for larger systems

• Attractive as throughput servers and for parallel programs

I/O systemMain memory

Bus

P1

MultilevelCache

Pn

MultilevelCache

Uniform access via loads/storesAutomatic data movement and

coherent replication in cachesCheap and powerful extension to

uniprocessorsKey is extension of memory

hierarchy to support multiple processors

Page 11: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Shared Address Space Programming Model

• A process is a virtual address space• With one or more threads of control

• Part of the virtual address space is shared by processes• Multiple threads share the address space of a single process

• All communication isthrough shared memory

• Achieved by loads and stores• Writes by one process/thread

are visible to others• Special atomic operations

for synchronization• OS uses shared memory

to coordinate processes

St or e

P1

P2

Pn

P0

Load

P0 p r i v at e

P1 p r i v at e

P2 pr i v at e

Pn pr i v at e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physicaladdress space

Shared portionof address space

Private portionof address space

Common physicaladdresses

Page 12: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Medium and Large Scale Multiprocessors

• Problem is interconnect: high cost (crossbar) or bandwidth (bus)• Centralized memory or uniform memory access (UMA)

• Latencies to memory uniform, but uniformly large• Interconnection network: crossbar or multi-stage

• Distributed shared memory or non-uniform memory access (NUMA)• Distributed memory forms one global shared physical address space• Access to local memory is faster than access to remote memory

° ° °

Interconnection Network

Distributed Shared Memory

M

P

$ M

P

$ M

P

$

° ° °

Interconnection Network

Centralized Memory

M

P

$

P

$

P

$

M M° ° °

Page 13: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Message Passing Architectures • Complete computer as a building block

• Includes processor, memory, and I/O system

• Easier to build and scale thanshared memory architectures

• Communication via explicit I/O operations• Communication integrated at I/O level, Not into memory system

• Much in common with networks of workstations or clusters• However, tight integration between processor and network• Network is of higher capability than a local area network

• Programming model• Direct access only to local memory (private address space) • Communication via explicit send/receive messages (library or system

calls)

° ° °

Interconnection Network

M

P

$ M

P

$ M

P

$

Page 14: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Message-Passing Abstraction

• Send specifies receiving process and buffer to be transmitted• Receive specifies sending process and buffer to receive into• Optional tag on send and matching rule on receive

• Matching rule: match a specific tag t or any tag or any process• Combination of send and matching receive achieves …

• Pairwise synchronization event• Memory to memory copy of message

• Overheads: copying, buffer management, protection• Example: MPI and PVM message passing libraries

Address XSend Q, X, t

MatchLocal process address space

Address Y

Local process address space

Process P Process Q

Receive P, Y, t

Page 15: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Variants of Send and Receive

• Parallel programs using send and receive are quite structured• Most often, all nodes execute identical copies of a program• Processes can name each other using a simple linear ordering

• Blocking send:• Sender sends a request and waits until the reply is returned

• Non-blocking send:• Sender sends a message and continues without waiting for a reply

• Blocking receive:• Receiver blocks if it tries to receive a message that has not arrived

• Non-blocking receive:• Receiver simply posts a receive without waiting for sender

Page 16: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

000001

010011

100

110

101

111

Evolution of Message-Passing Machines

• Early machines: FIFO on each link• Hardware close to programming model • Synchronous send/receive operations• Topology central (hypercube algorithms)

CalTech Cosmic Cube (Seitz)

Page 17: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidir ectional

2D grid networkwith processing nodeattached to every switch

Sandia’s Intel Paragon XP/S-based SuperComputer

Each card is an SMP with two or more i860 processors and a network interface chip connected to the cache-coherent memory bus.

Each node has a DMA engine to transfer contiguous chunks of data to and from the network at a high rate.

Page 18: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Example 1: Matrix Addition (1/3)

Assume an MIMD Shared-Memory Multiprocessor (SMP) with P processors. The objective is to write a parallel program for Matrix Addition C = A + B, where A, B, and C are all NxN matrices using P processor.

(1) How to partition the array of results (C) for parallel processing among P processes!

A few approaches: (1) Block-row partitioning, where each process is assigned N/p rows and N columns, or (2) Block-column partitioning, where each process is assigned N rows and N/p columns. The first is preferred due to row-major storage.

(2) Find the number of results assigned to each processes for load balancing

There are N 2 results to compute from array C. Each processor is to have N2/P for load balancing. Using Block-row partitioning, each process is assigned a block-row of (N/P)*N results.

(3) Determine the range of results that must be computed by each process as function of the process ID (Pid).

Using block-row partitioning, each process is assigned N/p rows and N columns, This can be achieved if we partition A into block-rows such that each processor will get the range:

Range = (Imin, Imax) = [Pid*N/P, Imin+N/P-1], where each process must plug its Pid in the above formula to work on its range, i.e. parallel SPMD program.

Page 19: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Example 1: Matrix Addition (2/3)

(3) Using a shared-memory multiprocessor, write an SPMV program to compute the Matrix Addition by P processes (threads) by assuming that all arrays are declared as shared data.

The SPMD program that will run on each processor will be: Imin=Pid*N/P; Imax= Imin+N/P-1 ; For i= Imin, I < Imax { For j=0, i<N c[i,j] = a[i,j] + b[i,j]; } Note: the range of computed data for each process depends on the Pid.

Page 20: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Example 1: Matrix Addition (3/3) (4) Using a Distributed-Memory Multiprocessor, write an SPMD program using MPI library to parallelize C=A*B, where each is an NxN matrix, using P processors. Assume block-row partitioning, each process will run only in its range of data, and the program and data return to master processor (process 0) after completion of all the iterations.

{MPI-init;

MPI-Comm-size(MPI-comm-wolrd, &numprocs) #Init MPI

MPI-Comm-rank(MPI-comm-wolrd, &mypid) #Get process id as mypid

array_size = N; #Now Process 0 scatter arrays A and B over all the processes:

MPI- scatter (&a, array_size=N, 0 (pid), mytag, MPI-Comm-world) #Scatter A

MPI- Scatter(&b, array_size=N, 0 (pid), mytag, MPI-Comm-world) # Scatter B

my_range = N/P; #Nbr of rows to processes by each process

For i= 0, i < my_range

{ For j = 0, I < N

c[i,j] = a[i,j] + b[i,j];

}

MPI- Gather(&c, array_size=N, 0 (pid), mytag, MPI-Comm-world)

}

Page 21: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Example 2: Summing 100,000 Numbers on 100 Processors

sum[Pid] = 0;for (i = 1000*Pid; i< 1000*(Pid+1); i = i + 1)

sum[Pid] = sum[Pid] + A[i];

Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pid is the processor’s id, i is a private variable)

The processors then coordinate in adding together the partial sums (P is a private variable initialized to 100 (the number of processors))

repeatsynch(); /*synchronize firstif (P%2 != 0 && Pid == 0)

sum[0] = sum[0] + sum[P-1];P = P/2if (Pid<P) sum[Pid] = sum[Pid] + sum[Pid+P]

until (P == 1); /*final sum in sum[0]

Page 22: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

An Example with 10 Processors

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9

sum[P0]sum[P1]sum[P2] sum[P3]sum[P4]sum[P5]sum[P6] sum[P7]sum[P8] sum[P9]

P0

P0 P1 P2 P3 P4

P = 10

P = 5

P1 P = 2

P0 P = 1

Page 23: Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can

Message Passing Multiprocessors

• Each processor has its own private address space• Q1 – Processors share data by explicitly sending and

receiving information (messages)• Q2 – Coordination is built into message passing primitives

(send and receive)