parallel processor organizations

PARALLEL PROCESSOR ORGANIZATIONS

Jehan-François Pâ[email protected]

Chapter Organization

• Overview• Writing parallel programs• Multiprocessor Organizations• Hardware multithreading• Alphabet soup (SISD, SIMD, MIMD, …)• Roofline performance model

OVERVIEW

The hardware side

• Many parallel processing solutions– Multiprocessor architectures

• Two or more microprocessor chips• Multiple architectures

– Multicore architectures• Several processors on a single chip

The software side

• Two ways for software to exploit parallel processing capabilities of hardware– Job-level parallelism

• Several sequential processes run in parallel• Easy to implement (OS does the job!)

– Process-level parallelism• A single program runs on several processors

at the same time

WRITING PARALLEL PROGRAMS

Overview

• Some problems are embarrassingly parallel– Many computer graphics tasks– Brute force searches in cryptography or

password guessing• Much more difficult for other applications

– Communication overhead among sub-tasks– Amdahl's law– Balancing the load

Amdahl's Law

• Assume a sequential process takes

– tp seconds to perform operations that could be performed in parallel

– ts seconds to perform purely sequential operations

• The maximum speedup will be

(tp + ts )/ts

Balancing the load

• Must ensure that workload is equally divided among all the processors

• Worst case is when one of the processors does much more work than all others

Example (I)

• Computation partitioned among n processors• One of them does 1/m of the work with m < n

– That processor becomes a bottleneck

• Maximum expected speedup: n

• Actual maximum speedup: m

Example (II)

• Computation partitioned among 64 processors• One of them does 1/8 of the work

• Maximum expected speedup: 64

• Actual maximum speedup: 8

A last issue

• Humans likes to address issues one after the order– We have meeting agendas– We do not like to be interrupted– We write sequential programs

Rene Descartes

• Seventeenth-century French philosopher• Invented

– Cartesian coordinates – Methodical doubt

• [To] never to accept anything for true which I did not clearly know to be such

• Proposed a scientific method based on four precepts

Method's third rule

• The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.

MULTI PROCESSOR ORGANIZATIONS

Shared memory multiprocessors

PU

Cache

PU

Cache

PU

Cache

Interconnection network

RAM I/O

…

Shared memory multiprocessor

• Can offer– Uniform memory access to all processors

(UMA)• Easiest to program

– Non-uniform memory access to all processors(NUMA)• Can scale up to larger sizes• Offer faster access to nearby memory

Computer clusters

PU

Cache

RAM

PU

Cache

RAM

PU

Cache

RAM

Interconnection network

…

Computer clusters

• Very easy to assemble• Can take advantage of high-speed LANs

– Gigabit Ethernet, Myrinet, …• Data exchanges must be done through

message passing

Message passing (I)

• If processor P wants to access data in the main memory of processor Q it must– Send a request to Q– Wait for a reply

• For this to work, processor Q must have a thread– Waiting for message from other processors– Sending them replies

Message passing (II)

• In a shared memory architecture, each processor can directly access all data

• A proposed solution– Distributed shared memory offers to the

users of a cluster the illusion of a single address space for their shared data

– Still has performance issues

When things do not add up

• Memory capacity is very important for big computing applications– If the data can fit into main memory, the

computation will run much faster• A company replaced

– Single shared memory computer with 32GB of RAM

A problem

• A company replaced – Single shared memory computer with 32GB of

RAM– Four “clustered” computers with 8GB each

• More I/O than ever• What did happen?

The explanation

• Assume OS occupies one GB of RAM– The old shared-memory computer still had 31

GB of free RAM– Each of the clustered computer has 7 GB of

free RAM• The total RAM available to the program went

down from 31 GB to 47 = 28 GB!

Grid computing

• The computers are distributed over a very large network– Sometimes computer time is donated

• Volunteer computing• Seti@Home

– Works well with embarrassingly parallel workloads• Searches in a n-dimensional space

HARDWARE MULTITHREADING

General idea

• Let the processor switch to another thread of computation while them current one is stalled

• Motivation:– Increased cost of cache misses

Implementation

• Entirely controlled by the hardware– Unlike multiprogramming

• Requires a processor capable of– Keeping track of the state of each thread

• One set of registers—including PC– for each concurrent thread

– Quickly switching among concurrent threads

Approaches

• Fine-grained multithreading:– Switches between threads for each instruction– Provides highest throughputs– Slows down execution of individual threads

Approaches

• Coarse-grained multithreading– Switches between threads whenever a long

stall is detected– Easier to implement – Cannot eliminate all stalls

Approaches

• Simultaneous multi-threading:– Takes advantage of the possibility of modern

hardware to perform different tasks in parallel for instructions of different threads

– Best solution

ALPHABET SOUP

Overview

• Used to describe processor organizations where– Same instructions can be applied to– Multiple data instances

• Encountered in– Vector processors in the past– Graphic processing units (GPU)– x86 multimedia extension

Classification

• SISD:– Single instruction, single data– Conventional uniprocessor architecture

• MIMD:– Multiple instructions, multiple data– Conventional multiprocessor architecture

Classification

• SIMD:– Single instruction, multiple data– Perform same operations on a set of similar data

• Think of adding two vectors

for (i = 0; i++; i < VECSIZE)sum[i] = a[i] + b[i];

Vector computing

• Kind of SIMD architecture– Used by Cray computers

• Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU

• Requires– Vector registers able to store

multiple values– Special vector instructions: say lv, addv, …

Benchmarking

• Two factors to consider– Memory bandwidth

• Depends on interconnection network– Floating-point performance

• Best known benchmark is LINPACK

Roofline model

• Takes into account– Memory bandwidth– Floating-point performance

• Introduces arithmetic intensity– Total number of floating point operations in a

program divided by total number of bytes transferred to main memory

– Measured in FLOPS/byte

Roofline model

• Attainable GFLOPS/s =Min(Peak Memory BWArithmetic

Intensity, Peak Floating-Point Performance

Roofline model

Peak floating-point performance

Floating-point performance islimited by memory bandwidth

parallel processor organizations

Documents