introduction multiprocessing architecture

IntroductionMultiprocessing architecture

Operating system organizationMemory access architectures

Scheduling and process migration

SMD149 - Operating Systems - Multiprocessing

Roland Parviainen

December 1, 2005

1 / 55

Overview

Multiprocessor systems

Multiprocessor, operating system and memory organizations

2 / 55

Introduction

Multiprocessor system

Computer that contains more than one processor

Benefits

Increased processing powerScale resource use to application requirements

Additional operating system responsibilities

All processors remain busyEven distribution of processes throughout the systemAll processors work on consistent copies of shared dataExecution of related processes synchronizedMutual exclusion enforced

3 / 55

Multiprocessing architecture

Examples of multiprocessors

Dual-processor personal computerPowerful server containing many processorsCluster of workstations

Classifications of multiprocessor architecture

Nature of datapathInterconnection schemeHow processors share resources

4 / 55

Flynn’s taxonomy

Based upon the number of concurrent instruction (or control) anddata streams available

SISD: Single instruction, single data streamMISD: Multiple instruction, single data streamSIMD: Single instruction, multiple data streamsMIMD: Multiple instruction, multiple data streams

5 / 55

Single instruction, single data stream

Sequential computer

Exploit minor parallelism:

PipeliningSuperscalar processorsHyper threading

6 / 55

Multiple instruction, single data stream

Not commonly used

Redundancy

Three processors do the same computation, results are compared

7 / 55

Single instruction, multiple data streams

One or more processing units

Same instruction on a block of data

Array or vector processors

Most processors have SIMD instructions

PowerPC’s AltiVec, Intel’s MMX, SSE, SSE2 and SSE3, AMD’s3DNow!, SPARCs VIS, PA-RISCs MAX and the MIPS MDMX andMIPS-3D.Multimedia

8 / 55

Multiple instruction, multiple data streams

Multiple autonomous processors simultaneously executing differentinstructions on different data

Multiprocessor systems

Network of workstations

Distributed systems

9 / 55

Processor interconnection schemes

Describes how the system’s components, such as processors andmemory modules, are connected

Consists of nodes (components or switches) and links (connections)

Parameters used to evaluate interconnection schemes

Node degreeBisection widthNetwork diameterCost of the interconnection scheme

10 / 55

Shared bus

Single communication path between all nodes

Contention can build up for shared bus

Lacks scalability

Useful for interconnecting a small number of processors

For systems with many processors, a bus could be used tointerconnect nodes within a supernode

Dynamic network

Example: dual-processor Intel Pentium

11 / 55

Shared bus

12 / 55

Crossbar-switch matrix

Separate path from every processor to every memory module (or ingeneral, from every node to every other node)

For n processors and m memory modules, there will be n x m totalswitches

High fault tolerance:

Bisection width = (n x m)/2

High performance:

Simultaneous transmissions can be supported

Network diameter = 1 yields low latency

High cost: proportional to n x m

Example: Sun UltraSPARC-III

13 / 55

Crossbar-switch matrix

14 / 55

2-D mesh network

n rows and m columns, in which a node is connected to nodesdirectly north, south, east and west of it

Maximum node degree = 4

Moderate performance, fault tolerance and cost

Not very scalable given that network diameter may be substantial

Example: Intel Paragon (large-scale system, but communication iskept between neighboring nodes)

15 / 55

2-D mesh network

16 / 55

Hypercube

n-dimensional hypercube has 2n nodes in which each node isconnected to n neighbor nodes

Better scalability and performance

4-D hypercube (16 nodes) results in network diameter of 4, while a2-D mesh network with 16 nodes results in network diameter of 6

More fault tolerant, but more expensive than crossbar switch or 2-Dmesh networks

Example: nCUBE (up to 8192 processors)

17 / 55

Hypercube

18 / 55

Multistage networks

Switch nodes act as hubs routing messages between nodes

Cheaper (use simple hardware for switches that can be packedtightly)

Less fault tolerant and worse performance compared to acrossbar-switch matrix

Contention can occur at switches and network diameter is typicallylarger

Example: IBM POWER4

19 / 55

Multistage networks

20 / 55

Loosely coupled vs. tightly coupled systems

Tightly coupled systems

Processors share most resources including memoryCommunicate over shared busesLess burden for operating system programmersLess flexible, poor scalability and fault tolerance

Loosely coupled systems

Processors do not share most resourcesDedicated memoryExplicit messages using communication link or shared virtual memoryMore flexible, fault tolerant, scalable

21 / 55

Loosely coupled vs. tightly coupled systems

22 / 55

Multiprocessor operating system organizations

Significantly different from uniprocessor systems

Systems classified into three types (based on how processors shareoperating system responsibilities)

Master/slaveSeparate kernelsSymmetrical organization

23 / 55

Master/slave

Master processor executes operating system code

Computation and I/O jobs

Slaves execute only user programs

Processor-bound jobs

Hardware asymmetry

Slave processors have to wait for the master processor to handle theinterrupt for OS operations

Low fault tolerance

Failure of the master processor results in system failure

Good for computationally intensive jobs

Example: nCUBE system

24 / 55

Separate kernels

Each processor executes its own operating system and responds onlyto interrupts generated from user processes running on thatprocessor

Loosely coupled

Little contention over resourcesSome globally shared operating system data

System failure unlikely, but failure of one processor results intermination of processes on that processor

Useful in environments where processes do not interact much andhigh availability is important

25 / 55

Symmetrical

Operating system manages a pool of identical processors

High amount of resource sharingNeed for mutual exclusionAllows processes to truly share the processors and exploit parallelism

Highest degree of fault tolerance of any organization

Facilitates graceful degradation

Some contention for resources can limit performance improvementgained by adding more processors

26 / 55

Memory access architectures

Classification based on how memory is shared

Trade offs between performance, scalability and cost

27 / 55

Uniform memory access (UMA) multiprocessor system

All processors share main memory

Access to any memory page is nearly the same for all processors andall memory modules

Typically uses shared bus or crossbar-switch matrix

Used in small multiprocessor systems (typically two to eightprocessors)

28 / 55

29 / 55

Nonuniform memory access (NUMA) multiprocessor

Global memory divided into memory partitions or modules

Each node contains a few processors and a portion of systemmemory, which is local to that node

Access to local memory faster than access to global memory (rest ofthe memory modules)

More scalable than UMA with fewer bus collisions

Most of processor’s memory request served by local memory

30 / 55

31 / 55

Basic physical interconnection similar to NUMA Main memory isviewed as a cache and called an attraction memory (AM)

Allows system to migrate data to node that most often accesses it atgranularity of a memory line (more efficient than a memory page)Reduces the number of cache misses serviced remotelyOverhead: duplicated data items and complex protocol to ensure allupdates are received at all processors

32 / 55

33 / 55

No-Remote Memory Access

Loosely coupled

Simple to build: no complex interconnection

No shared global memory

Each node maintains its own local memory

Communication through explicit messages

Some implement the illusion of shared global memory known asshared virtual memory (SVM)

Distributed systems

Involve message passing via remote procedure calls (RPCs)

Example: Google server farms

34 / 55

35 / 55

Memory sharing

Memory coherence

Ensure that all the processors have access to the most up- to-datecopy of dataCache coherency

Strategies to increase percentage of local memory hits

Page migrationPage replication

36 / 55

Cache coherence

UMA cache coherenceThree different ways to implement

Bus snooping (a.k.a. cache snooping)Maintain a centralized directory to indicate when to remove stale dataOnly one processor can cache a data item

Cache coherent NUMA

Each address is associated with a home nodeContact home node on reads and writes to receive most updatedversion of dataThe protocol requires a maximum of three network traversals percoherency operation

37 / 55

Page replication and migration

Page replication

System places a copy of a memory page at a new nodeGood for pages read by processes at different nodes, but not written

Page migration

System moves a memory page from one node to anotherGood for pages written to by a single remote process

Increases percentage of local memory hits

Drawbacks

Memory overheadReplicating and migrating more expensive than remotely referencingso only useful if referenced again

38 / 55

Scheduling

Determines the order and to which processors processes aredispatched

Two goals

Parallelism - timesharing schedulingProcessor affinity - space-partitioning scheduling

Soft affinityHard affinity

Types of scheduling algorithms

Job-blindJob-aware

39 / 55

Job-blind multiprocessor scheduling

Straightforward extension of uniprocessor algorithms

Examples

First-in-first out (FIFO) multiprocessor schedulingRound-robin (RR) multiprocessor schedulingShortest process first (SPF) multiprocessor scheduling

40 / 55

Job-aware multiprocessor scheduling

Algorithms that maximize parallelism

Shortest-number-of-processes-first (SNPF) schedulingRound-robin job (RRjob) schedulingCoscheduling

Processes from the same job placed in adjacent spots in the globalrun queueSliding window moves down run queue running adjacent processesthat fit into window (size = number of processors)

Algorithms that maximize processor affinity

Dynamic partitioning

41 / 55

Coscheduling example

42 / 55

Dynamic partitioning

43 / 55

Process migration

Transferring a process from one node to another node

Benefits of process migration

Increases fault toleranceLoad balancingReduces communication costsResource sharing

44 / 55

Flow of process migration

Either sender or receiver initiates request

Sender suspends migrating process

Sender creates message queue for migrating process’s messages

Sender transmits state to a “dummy” process at the receiver

Sender and receiver notify other nodes of process’s new location

Sender deletes its instance of the process

45 / 55

Flow of process migration

46 / 55

Process migration concepts

Trade-off between residual dependency and initial migration latency

Characteristics of a successful migration strategy

Minimal residual dependencyTransparentScalableFault tolerantHeterogeneous

47 / 55

Process migration strategies

Eager migration

Transfer entire state during initial migration

Dirty eager migration

Only transfer dirty memory pagesClean pages brought in from secondary storage

Copy-on reference migration

Similar to dirty eager except clean pages can also be acquired fromsending node on demand

Lazy copying

No initial transfer of pages

48 / 55

Process migration strategies

Flushing migration

Flush all dirty pages to disk before migrationTransfer no memory pages during initial migration

Precopy migration

Begin transferring memory pages before suspending processMigrate once dirty pages reach a lower threshold

49 / 55

Load balacing

System attempts to distribute the processing load evenly amongprocessors

Reduces variance in response timesEnsures no processors remain idle while other processors areoverloadedTwo types

Static load balancingDynamic load balancing

50 / 55

Static load balacing

Useful in environments in which jobs exhibit predictable patternsGoals

Distribute the loadReduce communication costs

Solve using a graphMust use approximations for large systems to balance loads withreasonable overhead

51 / 55

Static load balacing using graphs

52 / 55

Dynamic load balancing

More useful than static load balancing in environments wherecommunication patterns change and processes are created orterminated unpredictably

Types of policies

Sender-initiatedReceiver-initiatedSymmetricRandom

Algorithms

Bidding algorithmDrafting algorithm

53 / 55

Dynamic load balancing

Communication issues

Communication with remote nodes can have long delays; load on aprocessor might change in this timeCoordination between processors to balance load creates manymessagesCan overload the systemSolution

Only communicate with neighbor nodesLoad diffuses throughout the system

54 / 55

Summary

Next time: networks and distributed systems

55 / 55

introduction multiprocessing architecture

Documents

introduction to parallel programming & cluster computing...

python multiprocessing - computer...

linux : asymmetric multiprocessing (asmp) versus symmetric...

multiprocessing & cache coherency

cse 502 graduate computer architecture lec 16+17,19+20 –...

13. multiprocessing

multiprocessing i1

multiprocessing and numa

lecture 24: multiprocessing - systems group · lecture 24:...

cs152: computer systems architecture multiprocessing and

tutor 42 multiprocessing programming

parallel computing in python: multiprocessing

python org dev library multiprocessing

mpsoc – system architecture arm multiprocessing

asymmetric multiprocessing

computer science and engineering piranha: a scalable...

research multiprocessing techniques multifunctional...

cse 502 graduate computer architecture lec 16-18 –...

introduction to parallel programming & cluster computing...

power optimization through manycore multiprocessing