architectureapm/redai/docs/redai02v.pdfdirectory-based protocol •distributed directory contains...

Post on 13-Mar-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Parallel Computing

Architecture

2010@FEUP Architecture 2

• Uses of interconnection networks

• Connect processors to shared memory

• Connect processors to each other

• Interconnection media types

• Shared medium

• Switched medium

Interconnection Networks

2010@FEUP Architecture 3

Parallel Computers

• Vector Computers

• Multiple CPUs

• Instructions include direct vector operations

• Pipelined – data streams through vector arithmetic

units (CRAY)

• Processor array – processors execute the same

instruction

• Multiprocessors

• Multiple CPUs with shared memory

• Multicomputers

• Multiple CPUs with distributed memory

2010@FEUP Architecture 4

Processor Array

• Only well adapted to data parallel

problems

2010@FEUP Architecture 5

Multiprocessors

• Shared memory

• Can be built with comodity components

• Centralized

• Extension of a multiprocessor

• Add CPUs to a BUS

• Same memory access time

• UMA – Uniform memory access

• Also known as SMP (symmetric multiprocessor)

• Distributed

• Memory distributed among processors

• NUMA – Non-uniform memory access

• Allows greater numbers of processors

2010@FEUP Architecture 6

Centralized multiprocessors

• Problem: Cache coherence

• Write invalidate protocol

2010@FEUP Architecture 7

Most common solution to cache coherency

1. Each CPU’s cache controller monitors (snoops)

the bus & identifies which cache blocks are

requested by other CPUs.

2. A Processor gains exclusive control of data item

before performing “write”.

3. Before “write” occurs, all other copies of data

item cached by other Processors are

invalidated.

4. When any other CPU tries to read a memory

location from an invalidated cache block,

• a cache miss occurs

• it has to retrieve updated data from memory

Write Invalidate Protocol

2010@FEUP Architecture 8

Cache-coherence

Cache

CPU A

Cache

CPU B

Memory

7 X

2010@FEUP Architecture 9

CPU A

Cache

CPU B

Memory

X 7

7

Cache-coherence

Read from memory is

not a problem.

2010@FEUP Architecture 10

CPU A CPU B

Memory

X 7

7 7

Cache-coherence

2010@FEUP Architecture 11

CPU A CPU B

Memory

X 2

7 2

Cache-coherence

Write to memory is a

problem.

2010@FEUP Architecture 12

CPU A CPU B

Memory

X 7

7 7

Cache-coherence

A cache control

monitor snoops the bus

to see which cache

block is being

requested by other

processors.

2010@FEUP Architecture 13

CPU A CPU B

Memory

X 7

7 7

Cache-coherence

Intent to write X

Before a write can

occur, all copies of

data at that address

are declared invalid.

2010@FEUP Architecture 14

CPU A CPU B

Memory

X 7

7

Cache-coherence

Intent to write X

2010@FEUP Architecture 15

CPU A CPU B

Memory

X 2

Cache-coherence

2

When another processor

tries to read from this

location in cache, it

receives a cache miss

error and will have to

refresh from main

memory.

Distributed Multiprocessors

• Increase local memory bandwidth and

lower average memory access time

• The all memory has a single address

space

2010@FEUP Architecture 16

Cache Coherence

• Implementation more difficult

• No shared memory bus to “snoop”

• Directory-based protocol needed

• Some NUMA multiprocessors do not

support it in hardware

• Only instructions, private data in cache

• Large memory access time variance

2010@FEUP Architecture 17

Directory-based Protocol

• Distributed directory contains information about

cacheable memory blocks

• One directory entry for each cache block

• Each entry has

• Sharing status

• Which processors have copies

• Sharing status

• Uncached -- (denoted by “U”)

• Block not in any processor’s cache

• Shared – (denoted by “S”)

• Cached by one or more processors, read only

• Exclusive – (denoted by “E”)

• Cached by exactly one processor, write access

2010@FEUP Architecture 18

Directory-based Protocol

2010@FEUP Architecture 19

Interconnection Network

Directory

Local Memory

Cache

CPU 0

Directory

Local Memory

Cache

CPU 1

Directory

Local Memory

Cache

CPU 2

Directory-based Protocol

2010@FEUP Architecture 20

Interconnection Network

CPU 0 CPU 1 CPU 2

7 X

X U 0 0 0 Dir

Mem

Cache

Bit Vector

CPU0 reads X

2010@FEUP Architecture 21

Interconnection Network

CPU 0 CPU 1 CPU 2

7 X

X U 0 0 0 Dir

Mem

Cache

Bit Vector Read Miss

CPU0 reads X

2010@FEUP Architecture 22

Interconnection Network

CPU 0 CPU 1 CPU 2

7 X

X S 1 0 0 Dir

Mem

Cache

Bit Vector

7 X

CPU2 reads X

2010@FEUP Architecture 23

Interconnection Network

CPU 0 CPU 1 CPU 2

7 X

X S 1 0 0 Dir

Mem

Cache

Bit Vector

Read Miss

7 X

CPU2 reads X

2010@FEUP Architecture 24

Interconnection Network

CPU 0 CPU 1 CPU 2

7 X

X S 1 0 1 Dir

Mem

Cache

Bit Vector

7 X 7 X

CPU0 Writes 6 to X

2010@FEUP Architecture 25

Interconnection Network

CPU 0 CPU 1 CPU 2

7 X

X S 1 0 1 Dir

Mem

Cache

Bit Vector

7 X 7 X

Write Miss

CPU0 Writes 6 to X

2010@FEUP Architecture 26

Interconnection Network

CPU 0 CPU 1 CPU 2

7 X

X S 1 0 1 Dir

Mem

Cache

Bit Vector

7 X 7 X

Invalidate

CPU1 Reads X

2010@FEUP Architecture 27

Interconnection Network

CPU 0 CPU 1 CPU 2

7 X

X E 1 0 0 Dir

Mem

Cache

Bit Vector

6 X

Read Miss

CPU1 Reads X

2010@FEUP Architecture 28

Interconnection Network

CPU 0 CPU 1 CPU 2

6 X

X S 1 0 0 Dir

Mem

Cache

Bit Vector

6 X

Switch to Shared

CPU1 Reads X

2010@FEUP Architecture 29

Interconnection Network

CPU 0 CPU 1 CPU 2

6 X

X S 1 1 0 Dir

Mem

Cache

Bit Vector

6 X 6 X

CPU2 Writes 5 to X

2010@FEUP Architecture 30

Interconnection Network

CPU 0 CPU 1 CPU 2

6 X

X S 1 1 0 Dir

Mem

Cache

Bit Vector

6 X 6 X

Write Miss

CPU2 Writes 5 to X

2010@FEUP Architecture 31

Interconnection Network

CPU 0 CPU 1 CPU 2

6 X

X S 1 1 0 Dir

Mem

Cache

Bit Vector

6 X 6 X

Invalidate

CPU2 Writes 5 to X

2010@FEUP Architecture 32

Interconnection Network

CPU 0 CPU 1 CPU 2

6 X

X E 0 0 1 Dir

Mem

Cache

Bit Vector

5 X

CPU0 Writes 4 to X

2010@FEUP Architecture 33

Interconnection Network

CPU 0 CPU 1 CPU 2

6 X

X E 0 0 1 Dir

Mem

Cache

Bit Vector

5 X

Write Miss

CPU0 Writes 4 to X

2010@FEUP Architecture 34

Interconnection Network

CPU 0 CPU 1 CPU 2

5 X

X S 0 0 1 Dir

Mem

Cache

Bit Vector

5 X

Make shared

CPU0 Writes 4 to X

2010@FEUP Architecture 35

Interconnection Network

CPU 0 CPU 1 CPU 2

5 X

X U 0 0 0 Dir

Mem

Cache

Bit Vector Invalidate

CPU0 Writes 4 to X

2010@FEUP Architecture 36

Interconnection Network

CPU 0 CPU 1 CPU 2

5 X

X S 1 0 0 Dir

Mem

Cache

Bit Vector

5 X Creates cache

block storage

for X

CPU0 Writes 4 to X

2010@FEUP Architecture 37

Interconnection Network

CPU 0 CPU 1 CPU 2

4 X

X E 1 0 0 Dir

Mem

Cache

Bit Vector

5 X

Write X

CPU0 Writes Back X Block

2010@FEUP Architecture 38

Interconnection Network

CPU 0 CPU 1 CPU 2

4 X

X S 1 0 0 Dir

Mem

Cache

Bit Vector

4 X

Data Write Back

CPU0 flushes cache block X

2010@FEUP Architecture 39

Interconnection Network

CPU 0 CPU 1 CPU 2

X U 0 0 0 Dir

Mem

Cache

Bit Vector

4 X

Data Write Back

Multicomputer

• Distributed memory multiple-CPU

computer

• Same address on different processors

refers to different physical memory

locations

• Processors interact through message

passing

• Flavors

• Asymmetrical

• Symmetrical

• Mixed

2010@FEUP Architecture 40

Asymmetrical Multicomputer

• Back-end dedicated to parallel operations

• Single front-end computer can limit

scalability of system

• Every application requires development of

both front-end and back-end program

2010@FEUP Architecture 41

Symmetrical Multicomputer

• Every processor executes same program

• No simple way to balance program

development workload among processors

• More difficult to achieve high performance

with several processes on each processor

2010@FEUP Architecture 42

Mixed Cluster Multicomputer

• Co-located computers

• Dedicated to running parallel jobs

• Identical operating system

• Identical local disk images

2010@FEUP Architecture 43

Flynn’s Taxonomy

• Instruction stream

• Data stream

• Single vs. multiple

• Four combinations

• SISD

• SIMD

• MISD

• MIMD

2010@FEUP Architecture 44

Flynn’s Taxonomy

• SISD

• Single Instruction, Single Data

• Single-CPU systems

• Note: co-processors don’t count

• Can execute multiple functions

• Multiple I/O

• Example: PCs

• SIMD

• Single Instruction, Multiple Data

• Two architectures fit this category

• Pipelined vector processor

• Processor array

2010@FEUP Architecture 45

Flynn’s Taxonomy

• MISD

• Multiple Instruction, Single Data

• Example: systolic array

• MIMD

• Multiple Instruction, Multiple Data

• Multiple-CPU computers

• Multiprocessors

• Multicomputers

2010@FEUP Architecture 46

Systolic Array

• Multiple interconnected processing

elements

• Example: Sorting element

2010@FEUP Architecture 47

Input phase (1 clock)

3 inputs: a, b, c,

a

b

c min(a, b, c)

med(a, b, c)

max(a, b, c)

Output phase (1 clock)

3 outputs: min, med, max

A priority queue in a systolic array

• One insertion, 2 extractions

2010@FEUP Architecture 48

4

5

8

7

-∞ 5

8 4

7

-∞

∞ ∞

Inserting 7

4

5

7

8

∞ ∞

5

7 ∞

4

8

∞ ∞

Extraction

7

8

∞ ∞

7

∞ ∞

5

8 ∞

5

Extraction

top related