computer architecture shared memory mimd architectures

84
Computer Architecture Computer Architecture Shared Memory MIMD Shared Memory MIMD Architectures Architectures Ola Flygt Växjö University http://w3.msi.vxu.se/users/ofl/ [email protected] +46 470 70 86 49

Upload: dex

Post on 14-Feb-2016

68 views

Category:

Documents


0 download

DESCRIPTION

Computer Architecture Shared Memory MIMD Architectures. Ola Flygt Växjö University http://w3.msi.vxu.se/users/ofl/ [email protected] +46 470 70 86 49. Outline. Multiprocessors Cache memories Interconnection network Shared path Switching networks Arbitration - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computer Architecture Shared Memory MIMD Architectures

Computer Computer ArchitectureArchitecture

Shared Memory MIMD Shared Memory MIMD ArchitecturesArchitecturesOla Flygt

Växjö Universityhttp://w3.msi.vxu.se/users/ofl/

[email protected]+46 470 70 86 49

Page 2: Computer Architecture Shared Memory MIMD Architectures

Outline Multiprocessors Cache memories Interconnection network

Shared path Switching networks

Arbitration Blocking in multistage networks Combining switches Cache coherency Synchronization

CH01

Page 3: Computer Architecture Shared Memory MIMD Architectures

Multi-processor:Structure of Shared Memory MIMD Architectures

Page 4: Computer Architecture Shared Memory MIMD Architectures

Multi-processor (shared memory system):Problems

Memory Access Time can be a bottleneck even in a single-processor system

Contention for Memory two or more processors want to access a location in

the same block at the same time (hot spot problem). Contention for Communication

processors should share and use exclusively elements of the Interconnection Network

Result: long latency-time, idle processors, nonscalable system

Page 5: Computer Architecture Shared Memory MIMD Architectures

How to increase scalability

1. To do something with memory organization Distributed memory seems to be more efficient;

while processors are using their private memory (as it is the case in executing a process with good locality), they will not disturb each other.

Problem: it is mostly left to the users to configure the system efficiently.Let's apply cache and automatic data-migration based on the old, good principle of locality.

Page 6: Computer Architecture Shared Memory MIMD Architectures

How to increase scalability

2. To apply efficient Interconnection Network Fast (bandwidth) Flexible (no unnecessary restriction of multiple

concurrent communication) Safe (no interference) Support for broadcasting and multicasting

3. To do something with idle processors waiting for memory or communication Using the old, good principle of multiprogramming in a

lower-level layer: support for thread-level parallelism within a processor.

Page 7: Computer Architecture Shared Memory MIMD Architectures

Memory OrganizationIdeas:

1. Cache  Provide each processor with a cache memory, and

apply an appropriate automatic data-exchange mechanism between the caches and the main memory.

Cache coherence problem. 2. Virtual (or Distributed) Shared Memory 

Distribute the global memory to processors. Provide each processor with a private memory, but allow them to access the memory of other processors - as part of a global address space - too.  

NUMA, COMA, CC-NUMA machines

Page 8: Computer Architecture Shared Memory MIMD Architectures

Using Caches

Effects of cache memory Reduced latency (shorter average memory access

time) Reduced traffic on IN Less chance to wait for communication or memory

Problem of Cache Coherence

Page 9: Computer Architecture Shared Memory MIMD Architectures

Typical Cache Organization

Page 10: Computer Architecture Shared Memory MIMD Architectures

Design space and classification of shared

memory computers

Page 11: Computer Architecture Shared Memory MIMD Architectures

Dynamic interconnection networks

Enable the temporary connection of any two components of a multiprocessor.

There are two main classes according to their working mode:Shared path networksSwitching

Page 12: Computer Architecture Shared Memory MIMD Architectures

Shared path networks Those networks that provide continuous

connection among the processors and memory blocks It was typically a single bus in the first generation multiprocessors. In recent third generation machines hierarchical bus-systems are introduced.

Drawbacks:they can support only a limited number of

processors (bus connection)

Page 13: Computer Architecture Shared Memory MIMD Architectures

Switching networks-Does not provide a continuous

connection among the processors and memory blocks, rather a switching mechanism enables to temporarily connect processors to memory blocks.

Drawbacks:too expensive

Page 14: Computer Architecture Shared Memory MIMD Architectures

Shared path networksSingle shared bus

Advantages: Its organisation is simply a generalisation and extension of

the buses employed in uniprocessor systems. It contains the same bus lines (address, data, control,

interrupt) as uniprocessors and some additional ones to solve the contention on the bus when several processor simultaneously want to use the shared bus. These lines are called arbitration lines

It is very cost-effective interconnection scheme.  Drawback:

The contention on the shared bus represents a strong limitation concerning the number of applicable processors.

Page 15: Computer Architecture Shared Memory MIMD Architectures

Shared path networksSingle shared bus

The typical structure of a single bus based multiprocessor without coherent caches

Page 16: Computer Architecture Shared Memory MIMD Architectures

Comparison of write latencies of various buses

Page 17: Computer Architecture Shared Memory MIMD Architectures

Comparison of read latencies of various buses

Page 18: Computer Architecture Shared Memory MIMD Architectures

Arbiter logics Arbiters play a crucial role in the

implementation of pended and split-transaction buses. These are the so-called 1-of-N arbiters since they grant the requested resource (the shared bus) only to one of the requesters.

Page 19: Computer Architecture Shared Memory MIMD Architectures

Design Space for Arbiter logics

Page 20: Computer Architecture Shared Memory MIMD Architectures

Centralized arbitration with independent requests and

grants

Page 21: Computer Architecture Shared Memory MIMD Architectures

Daisy-chained bus arbitration scheme

centralised version with fixed priority policy

Page 22: Computer Architecture Shared Memory MIMD Architectures

Structure of a decentralized rotating arbiter with independent requests and

grants

The priority loop of the rotating arbiter works similarly to the grant chain of the daisy-chained arbiter.

Page 23: Computer Architecture Shared Memory MIMD Architectures

Multiple shared bus Problem: the limited bandwidth of the single

shared bus Solve: => to multiply the number of employed

buses similarly to the processors and memory units. 

Four different ways: 1. 1-dimension multiple bus system2. 2- or 3-dimension bus systems3. cluster bus system4. hierarchical bus system

Page 24: Computer Architecture Shared Memory MIMD Architectures

1-dimension multiple bus system

Page 25: Computer Architecture Shared Memory MIMD Architectures

The arbitration in 1-dimension multiple bus

systems The arbitration is a two stage-process 1. The 1-of-N arbiters (one per memory unit) can

resolve the conflict when several processors require exclusive access to the same shared memory unit. 

2. After the first stage m (out of n) processors can obtain access to one of the memory units.

When the number of buses (b) is less than that of the memory units (m), a second stage of arbitration is needed where an additional b-of-m arbiter is employed to allocate buses to those processors that successfully obtained access to a memory unit.

Page 26: Computer Architecture Shared Memory MIMD Architectures

Cl

uster

bus

syste

m

Page 27: Computer Architecture Shared Memory MIMD Architectures

Switching networksCrossbar

Page 28: Computer Architecture Shared Memory MIMD Architectures

Switching networksCrossbar

Advantages: most powerful network type it provides simultaneous access among all the inputs

and outputs of the network providing that all the requested outputs are different.

the large number of individual switches which are associated with any pair of input and output of the network 

Drawback  enormous price the wiring and the logic complexity increase

Page 29: Computer Architecture Shared Memory MIMD Architectures

Switching networksCrossbar

Detailed structure of a crossbar network 

All the switches should contain: an arbiter logic to allocate

the memory block in the case of conflicting requests

a multiplexer module to enable the connection between the buses of the winner processor and the memory buses.

Page 30: Computer Architecture Shared Memory MIMD Architectures

Multistage networks

This is a compromise between the single bus and the crossbar switch interconnections (from the point of view of implementation complexity, cost, connectivity, and bandwidth)

A multistage network consists of alternating stages of links and switches.

They can be categorised based on the number of stages, the number of switches at a stage, the topology of links connecting subsequent stages, and the type of switches employed at the stages

Page 31: Computer Architecture Shared Memory MIMD Architectures

The complete design space of multistage

networks

Page 32: Computer Architecture Shared Memory MIMD Architectures

Multistage networksOmega network

This is the simplest multistage network:  It has log2N stages with N/2 switches at each stage. All the switches has two input and two output links. Any single input can be connected to any output. Four different switch positions:

upper broadcast, lower broadcast, straight through, switch

Page 33: Computer Architecture Shared Memory MIMD Architectures

Multistage networksOmega network

Page 34: Computer Architecture Shared Memory MIMD Architectures

Multistage networksOmega network

The state of the switches when P2 sends a broadcast message

Page 35: Computer Architecture Shared Memory MIMD Architectures

Blocking networkAny output can be accessed from any input

by setting the switches, but:the simultaneous access of all the outputs

from different inputs is not always possible. The possible sets of transformations mapping

all inputs to a different output-=> permutations.

In blocking networks there are permutations that can not be realised by any program of the switches.

Page 36: Computer Architecture Shared Memory MIMD Architectures

Blocking in an Omega network

No matter how the other inputs are mapped to the outputs, a conflict appears at switch A, resulting the blocking of either 0->5 or the 6->4 message.

A

Page 37: Computer Architecture Shared Memory MIMD Architectures

Blocking and nonblocking network

Blocking networks (multistage networks)The simultaneous access of all the outputs from

different inputs is not always possible. Possibility of improvement of the parallel access

mechanism: additional stages to introduce redundant paths in the

interconnection scheme /Benes network/ => rearrangeable nonblocking network.

/=> increased size, latency, and cost of the network/ Multistage networks were quite popular in early

large-scale shared memory systems /for example: NYU Ultracomputer, CEDAR, HEP, etc./

Page 38: Computer Architecture Shared Memory MIMD Architectures

Blocking and nonblocking network

 Nonblocking network (crossbar interconnection)Any simultaneous input-output combination

is possible.

Page 39: Computer Architecture Shared Memory MIMD Architectures

Three stage Clos network

Page 40: Computer Architecture Shared Memory MIMD Architectures

Three stage Benes network

Page 41: Computer Architecture Shared Memory MIMD Architectures

8 x 8 baseline network

Page 42: Computer Architecture Shared Memory MIMD Architectures

Shuffle Exchange network

Page 43: Computer Architecture Shared Memory MIMD Architectures

Delta network

Page 44: Computer Architecture Shared Memory MIMD Architectures

Generalized Shuffle network stage

Page 45: Computer Architecture Shared Memory MIMD Architectures

Extra stage Delta network

Page 46: Computer Architecture Shared Memory MIMD Architectures

The summary of properties of multistage

networks

Page 47: Computer Architecture Shared Memory MIMD Architectures

Techniques to avoid hot spots

In multistage network based shared memory systems hundreds of processors can compete for the same memory location. This place of the memory: => hot spot

Problem: They enter at two different inputs to the switch but want to exit

at the same output.   Solutions:

queuing networksThese temporarily hold the second message in the switch applying a queue store being able to accommodate a small number of messages.

nonqueuing networksThese reject the second message so that unsuccessful messages retreat and leave the network free.

Page 48: Computer Architecture Shared Memory MIMD Architectures

Hot spot saturation in a blocking Omega network

Page 49: Computer Architecture Shared Memory MIMD Architectures

Asymptotic bandwith in presence of hot spot

Page 50: Computer Architecture Shared Memory MIMD Architectures

Techniques to avoid hot spots

Solutions (cont.):combining networks

They are able to recognise that two messages are directed to the same memory module and in such cases they can combine the two messages into a single one.

This technique is particularly advantageous in the implementation of synchronisation tools like semaphores and barriers which are frequently accessed by many processes running on distinct processors.

Page 51: Computer Architecture Shared Memory MIMD Architectures

Structure of a combining switch

This structure was used in the NYU Ultracomputer (shown on next slide).

If the two requests refer to the same memory address the corresponding combining queue forwards one request to the memory block

places the second request in the associated wait buffer

Page 52: Computer Architecture Shared Memory MIMD Architectures

Structure of a combining switch

Page 53: Computer Architecture Shared Memory MIMD Architectures

Fetch-and-add operations in a multistage network

Page 54: Computer Architecture Shared Memory MIMD Architectures

Cache CoherenceCache coherence problems

Cache memories are introduced into computers in order to bring data closer to the processor. In multiprocessor machines where several processors require a copy of the same memory block, the maintenance of consistency among these copies raise the so-called cache coherence problem that can arise from three reasons:1. Sharing of writable data2. Process migration3. I/O activity

Page 55: Computer Architecture Shared Memory MIMD Architectures

Cache Coherence data structures

Types of data causing less or more problem with coherence: read-only: no change in run-time (program-code, constants)

no change - no problem with coherence private writable/readable: used by a single process (local

variables, process state variables) problem only in case of process migration

private writable / shared readable: a single process manages all changes, but more processes read the result problematic

shared writable/readable: used (and written) by more processes (global variables) the most problematic

These types of data can be separated by compiler (and/or user) assistance

Page 56: Computer Architecture Shared Memory MIMD Architectures

Cache CoherenceLevels of solution:

HW-based protocol for all data categories - total but complex solution

SW-based solutions with some HW support, restrictions – compromise

Shared writable data are not cached - compromise, no solution for critical situations

Page 57: Computer Architecture Shared Memory MIMD Architectures

HW-Based Cache Coherence Protocols

We discuss hardware-based protocols from three points of view:   How they keep coherence of the updated local copies

and the main memory memory update policy 

How they keep coherence of several local copies cache coherence policy 

How they work in detail (algorithm and data structures) protocol type

what is determined mainly by the interconnection network

Page 58: Computer Architecture Shared Memory MIMD Architectures

Design space for HW-Based Cache Coherence

Protocols

Page 59: Computer Architecture Shared Memory MIMD Architectures

Memory Update Policieswrite-through: (a greedy policy)

as a data is updated in one of the local caches, its copy in the main memory is immediately updated, too

- unnecessary traffic on interconnection in case of private data and of infrequently used shared data

+ more reliable (error detection and recovery features of the main memory)

Page 60: Computer Architecture Shared Memory MIMD Architectures

Memory Update Policieswrite-back: (a lazy policy)

data in memory is updated only at some events (eg. data is replaced or invalidated in the cache)

allows a temporary incoherence of caches and memory

while not updated, read-references to memory will be redirected to the appropriate cache 

- more complex cache controllers

Page 61: Computer Architecture Shared Memory MIMD Architectures

Memory Update Policies

Page 62: Computer Architecture Shared Memory MIMD Architectures

Cache Coherence Policieswrite-update: (a greedy policy)

as data is updated in one of the local caches, its copies in other caches are immediately updated

copy in the main memory may or may not be updated

- immediate data migration, unnecessary traffic in case of private data and of infrequently used shared data

- cache controllers have to accept requests not only from their own processor, but from other cache controllers

Page 63: Computer Architecture Shared Memory MIMD Architectures

Cache Coherence Policieswrite-invalidate: (a lazy policy)

as data is updated in one of the local caches, all other copies in other caches and in the main memory are immediately invalidated

while not updated, data is provided by the updating processor’s cache for read operations of other processors

- cache controllers have to accept invalidate command from other cache controllers

Page 64: Computer Architecture Shared Memory MIMD Architectures

Cache Coherence Policies

Page 65: Computer Architecture Shared Memory MIMD Architectures

HW cache protocol types snoopy cache protocol: (used mainly in single bus

interconnections) See coming slides

hierarchical cache protocol: (used in hierarchical bus interconnections) according to the hierarchical structure, starting from the

bottom level, we can place a 'supercache' at each segment of the bus, which supercache serves as a connection to the higher level bus

directory schemes: (used in general interconnections) updating processor multicasts coherence commands exactly

to those caches having a copy of the data several directory scheme

Page 66: Computer Architecture Shared Memory MIMD Architectures

Snoopy cache protocolsused mainly in single bus interconnectionsboth updating and invalidating versions are

usedupdating processor broadcasts the update or

invalidate command to all other cachescache controllers 'snoop' on the bus for

coherence commands, and update or invalidate their cached blocks if necessary

Page 67: Computer Architecture Shared Memory MIMD Architectures

Snoopy cache protocolsBasic solution

Memory always up-to-date, write-through and write-invalidate  Situations:

at references of the local processor: Read hit: use the copy from local cache (no bus-cycle) Read miss: fetch from memory (cache replacement policy) (bus-cycle) Write hit: invalidate other caches, update cache and memory (bus-

cycle) Write miss: fetch from memory (cache replacement policy), invalidate

other caches, update cache and memory (bus-cycle) Replacement: find free or chose victim, nothing to do with the old

content, load new block from the memory at bus cycles of other processors:

Recognizing a write-cycle on the bus - executed by an other processor - to a block, a valid copy of which the local cache has, the local copy of the block should be invalidated.

Page 68: Computer Architecture Shared Memory MIMD Architectures

State transition graph

The state transition diagram defines how the cache controller should work when a request is given by the associated processor or by other caches through the bus.

For example, when a BusRd command arrives to a cache block in state Modified, the cache controller should modify the state of the block to Shared modified.

E = Exclusive stateM = Modified stateSc = Shared clean stateSm = Shared modified state

Page 69: Computer Architecture Shared Memory MIMD Architectures

Structure of the snoopy cache

controller

Page 70: Computer Architecture Shared Memory MIMD Architectures

Software-based Cache Coherence protocols

Software-based approaches represent a good and competitive compromise since they require nearly negligible hardware support and they can lead to the same small number of invalidation misses as the hardware-based protocols

All the software-based protocols rely on compiler assistance

Page 71: Computer Architecture Shared Memory MIMD Architectures

Software-based Cache Coherence protocols

(cont.)The compiler analysis the program and

classifies the variables according to their use into one of the four classes:1. Read-only2. Read-only for any number of processes

and read-write for one process3. Read-write for one process4. Read-write for any number of processes

Page 72: Computer Architecture Shared Memory MIMD Architectures

Software-based cache Coherence protocols

(cont.)Read-only variables can be cached without

restrictions. Type 2 variables can be cached only for that

processor where the read-write process runs. Since only one process uses type 3 variables

it is sufficient to cache them only for that process.

Type 4 variables must not be cached in software-based schemes.

Page 73: Computer Architecture Shared Memory MIMD Architectures

Software-based Cache Coherence protocols

(cont.) Variables demonstrate different behavior in different

program sections and hence the program is usually divided into sections by the compiler and the variables are categorized independently in each section.

For example, a parallel for-loop is a typical program section.

Typically at the end of each program section, the caches must be invalidated to ensure a consistent state of variables before starting a new section. According to the way the invalidation is realized, two main schemes can be distinguished, Indiscriminate invalidation and Selective invalidation. These can in turn be further divided into subcategories.

Page 74: Computer Architecture Shared Memory MIMD Architectures

The design space for Software-based protocols

Page 75: Computer Architecture Shared Memory MIMD Architectures

Synchronization in multiprocessors

Mutual exclusion and other synchronisation problems can be solved by high level synchronisation language constructs like semaphores, conditional critical regions, monitors, etc.

All of these high level schemes are based on some low level synchronisations tools realised or supported by hardware.

In cache coherent architectures, the atomic test&set operation is usually replaced with a cached test-and-test&set scheme

Page 76: Computer Architecture Shared Memory MIMD Architectures

Synchronization in multiprocessors

Requirements for the test-and-test&set scheme minimum amount of traffic generated while waiting low latency release of a waiting processor low latency acquisition of a free lock

These schemes are moderately successful in small cache based systems like shared bus based multiprocessors but usually fail in scalable multiprocessors where high-contention locks are frequent.

Page 77: Computer Architecture Shared Memory MIMD Architectures

Synchronization in multiprocessorsSimple test&set

One of the main problems of implementing synchronization schemes in cache coherent architectures appears in deciding what happens if the test&set operation was failed, i.e. the lock was in the state CLOSED.

Obviously, as the definition of the test&set operation shows, the processor should repeat the operation as long as the lock is CLOSED.

This is a form of busy waiting which ties up the processor in an idle loop and increases the shared bus traffic and contention.

This type of lock that relies on busy waiting is called spin-lock and considered as a significant cause of performance degradation when a large number of processes simultaneously use it.

Page 78: Computer Architecture Shared Memory MIMD Architectures

Synchronization in multiprocessors

The problem of thrashing:Two processors are trying to close a lockBoth are waiting for someone elseEach time they (busy-wait) try the lock

they will cause a cache-miss with subsequent handling of that problem

The effect will be that both processors and the bus is busy dealing with cache misses

Page 79: Computer Architecture Shared Memory MIMD Architectures

Synchronization in multiprocessors

Alternatives to spin-locks:Snooping lock, requires hardware

supporttest-and-test&settest-and-test&set collision avoidance lockstournament locks.queue lock

Page 80: Computer Architecture Shared Memory MIMD Architectures

Event ordering in cache coherent systems

In order to understand what correct parallel program execution means in a cache coherent multiprocessor environment it should be discussed what requirements a correct solution should satisfy.

The generally accepted requirement is sequential consistency. “A system is sequentially consistent if the result of any

execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by its program."

In other words, a parallel program execution is sequentially consistent if its any execution is equivalent with an interleaved execution on a uniprocessor system.

Page 81: Computer Architecture Shared Memory MIMD Architectures

Event ordering in cache coherent systems

A necessary and sufficient condition for a system with atomic memory accesses to be sequentially consistent is that memory accesses should be performed in program order.

Systems for which such a condition holds are called strongly ordered systems.

A memory access is atomic if its affect is observable for each processor of the parallel computer at the same time.

It can be shown that memory accesses in parallel systems without caches are always atomic and hence, for them it is sufficient to be strongly ordered for maintaining sequentially consistency.

Page 82: Computer Architecture Shared Memory MIMD Architectures

Event ordering in cache coherent systems

For simple Bus systems this can easily be satisfied

For other systems a relaxed consistency is required. Alternatives are:processor consistencyweak consistency modelrelease consistency model

Page 83: Computer Architecture Shared Memory MIMD Architectures

Design space of single bus based multiprocessors

Page 84: Computer Architecture Shared Memory MIMD Architectures

The convergence of scalable MIMD computers