pa adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •c oarse-grain −p...

30
Advanced Topics Chapter 17 Parallelism CS250 -- Part V 1 Dr. Rajesh Subramanyan, 2005

Upload: others

Post on 22-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Advanced Topics

Chapter 17

Parallelism

CS250 -- Part V

1D

r.Rajesh Subram

anyan, 2005

Page 2: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Topics

•Introduction

•C

haracterizations of Parallelism

•M

icroscopic Vs M

acroscopic

•Sym

metric V

s. Asym

metric

•Fine G

rain Vs C

oarse-grain

•E

xplicit Vs. Im

plicit

•Parallel A

rchitectures

•SISD

•SIM

D

CS250 -- Part V

2D

r.Rajesh Subram

anyan, 2005

Page 3: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Topics

•M

IMD

•Sym

metric M

ultiprocessors

•C

omm

unication, Coordination, C

ontention

•Perform

ance of Multiprocessors

•C

onsequence for Programm

ers

•L

ocks and Mutual E

xclusion

•Program

ming E

xplicit and Implicit Parallel C

omputers

•R

edundant Parallel Architectures

•D

istributed and Cluster C

omputers

CS250 -- Part V

3D

r.Rajesh Subram

anyan, 2005

Page 4: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Introduction

•Previous parts covered

−processors

−m

emory

−I/O

•N

ow

−advanced topics

−use of parallelism

to increase speed

−parallel hardw

are, taxonomy,lim

itations etc.

•Parallel and pipelined architectures

−tw

o fundamental techniques or basis of increasing speed

CS250 -- Part V

4D

r.Rajesh Subram

anyan, 2005

Page 5: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Characterizations of P

arallelism

•Parallel and non-parallel are the extrem

es

•T

he greyregion in-betw

een describes amount and type of

parallelism

−m

icroscopic vs macroscopic

−sym

metric vs. asym

metric

−fine grain vs coarse-grain

−explicit vs. im

plicit

CS250 -- Part V

5D

r.Rajesh Subram

anyan, 2005

Page 6: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Microscopic V

s Macroscopic

•M

icroscopic

−parallelism

in a computer rem

ains hidden insidesubcom

ponents

−parallel hardw

are within specific com

ponent

•M

acroscopic

−parallelism

is a basic premise around w

hich the system is

designed

CS250 -- Part V

6D

r.Rajesh Subram

anyan, 2005

Page 7: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Exam

ples

•M

icroscopic

−A

LU

, registers, physical mem

ory,parallel busarchitecture.

•M

acroscopic

−use parallelism

across multiple large-scale com

ponents ofa

computer system

s, multiple identical processors,

multiple dissim

ilar processors

CS250 -- Part V

7D

r.Rajesh Subram

anyan, 2005

Page 8: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Symm

etric Vs. A

symm

etric

•Sym

metric

−replication of identical elem

ents

−e.g. dual processors

•A

symm

etric

−m

ultiple elements that function at the sam

e time but differ

from each other

−PC

with graphics coprocessor and m

ath coprocessor

CS250 -- Part V

8D

r.Rajesh Subram

anyan, 2005

Page 9: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Fine G

rain Vs C

oarse-grain

•Fine G

rain

−parallelism

at the levelofindividual instructions or

individual data elements

−e.g. graphics processor that uses 16 hardw

are units toupdate 16 bytes of an im

age at the same tim

e

•C

oarse-grain

−parallelism

on the levelofprogram

s or large blocks ofdata

−e.g. dual processor,using a processor each to print andcom

pose an email sim

ultaneously

CS250 -- Part V

9D

r.Rajesh Subram

anyan, 2005

Page 10: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Explicit V

s. Implicit

•Im

plicit

−hardw

are handles parallelism autom

atically without

requiring a programm

er to initiate or control parallelexecution

•E

xplicit

−program

mer m

ust control each parallel unit

CS250 -- Part V

10D

r.Rajesh Subram

anyan, 2005

Page 11: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Parallel A

rchitectures

•Parallel architectures: parallelism

of the central featuresaround w

hich the entire system is designed

•Flynn’s

classification

Nam

e M

eanin

gS

ISD

S

ing

leIn

structio

n S

ing

le Data stream

SIM

D

Sin

gle

Instru

ction

Mu

ltiple D

ata streams

MIM

D

Mu

ltiple

Instru

ction

s Mu

ltiple D

ata streams

Terminology used to characterize com

puters according to theam

ount and type of parallelism.

Note, hybrids that span

multiple types also exist.

CS250 -- Part V

11D

r.Rajesh Subram

anyan, 2005

Page 12: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

SISD

•Single Instruction Single D

ata stream

•A

lso called sequential or uniprocessor architecture

•Follow

s Von N

eumann’s

architecture

•D

oes not support macroscopic parallelism

, can useparallelism

internally

CS250 -- Part V

12D

r.Rajesh Subram

anyan, 2005

Page 13: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

SIMD

•Single Instruction stream

s Multiple D

ata streams

•E

ach instruction specifies a single operation applied to many

data items sim

ultaneously

•V

ector,array processors

for i from 1 to N

{

V[i]

←V

[i]×

Q;

}

Algorithm

for vector normalization in a sequential com

puter.

•Sam

e in a SIMD

computer

V<

-V

XQ

•G

raphic processors

CS250 -- Part V

13D

r.Rajesh Subram

anyan, 2005

Page 14: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

MIM

D

•M

ultiple Instruction streams M

ultiple Data stream

s

•E

ach processor performs independent com

putations at thesam

e time

•E

xample of M

IMD

−sym

metric m

ultiprocessors (SMP)

−asym

metric m

ultiprocessors (AM

P)

−m

ath graphics coprocessors

−I/O

processors

−C

DC

peripheral processors

CS250 -- Part V

14D

r.Rajesh Subram

anyan, 2005

Page 15: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Symm

etric Multiprocessors

Main

Mem

or y

(variou

sm

od

ules)

Devices

P1Pi

P2

Pi+1

PN

Pi+2

Symm

etric multiprocessor w

ith N identical processors. E

achprocessor has access to m

emory I/O

devices.

CS250 -- Part V

15D

r.Rajesh Subram

anyan, 2005

Page 16: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Perform

ance of Multiprocessors

•Speedup =

execution time on a single processor/execution

time required on a m

ultiprocessor

CP

U

I/O d

evices

PP

1

PP

2

PP

3P

P4

PP

5

PP

6

PP

7P

P8

PP

9

PP

10

Illustration of the ideal and typical performance of a

multiprocessor as the num

ber of processors is increased.V

alues on the y-axis list the relative speedup compared to a

single processor.

CS250 -- Part V

16D

r.Rajesh Subram

anyan, 2005

Page 17: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Perform

ance of Multiprocessors

•R

easons why

multiprocessor architectures have not fulfilled

the promise of scalable, high perform

ance computing.

−O

Sbottlenecks

−m

emory contention

−I/O

bottlenecks

When used for general-purpose com

puting,am

ultiprocessor does not perform w

ell. In some cases,

added overhead means perform

ance decreases as more

processorsare added.

CS250 -- Part V

17D

r.Rajesh Subram

anyan, 2005

Page 18: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Consequence for P

rogramm

ers

•L

ocks and mutual exclusion

•Program

ming explicit and im

plicit parallel computers

•Program

ming sym

metric and asym

metric m

ultiprocessors

CS250 -- Part V

18D

r.Rajesh Subram

anyan, 2005

Page 19: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Locks and M

utual Exclusion

•Shared variables in m

ultiprocessors face the following

problem.

−W

hat happens if two

processors attempt to increm

ent atthe sam

e time?

−V

ariable is incremented only once instead of the correct

two

times.

•O

ne solution: locks

CS250 -- Part V

19D

r.Rajesh Subram

anyan, 2005

Page 20: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Locks and M

utual Exclusion

loadx, R5

incrR5

storeR5, x

An exam

ple sequence of machine instructions used to increm

enta

variable in mem

ory.Inm

ost architectures, increment entails a

load and store.

CS250 -- Part V

20D

r.Rajesh Subram

anyan, 2005

Page 21: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Locks

lock17loadx, R

5incrR

5storeR

5, xrelease 17

Illustration of the instructions used to guarantee exclusive accessto a variable. A

separate lock is assigned to each shared item.

•O

nly one copyof

lock exists, only one processor can gainaccess to lock at a tim

e.

CS250 -- Part V

21D

r.Rajesh Subram

anyan, 2005

Page 22: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Locks

•Problem

s

−program

mers forget to apply locks to shared variables

−errors due to sim

ultaneous access depend on timing and

are rare to detect

−locking can reduce perform

ance

−locking adds overhead

CS250 -- Part V

22D

r.Rajesh Subram

anyan, 2005

Page 23: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Program

ming E

xplicit and Implicit P

arallel Com

puters

Fr om

apro gram

mer’s

point of view, a system

that uses explicitparallelism

is significantly more

complex

topro gram

that asystem

that uses implicit parallelism

.

CS250 -- Part V

23D

r.Rajesh Subram

anyan, 2005

Page 24: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Redundant P

arallel Architectures

•U

se of parallel hardware to im

prove reliability and preventfailure

•M

ultiple copies of a hardware unit that operate in parallel to

perform an operation (sam

e data and same operation on all

units)

•W

hat happens if redundant copies of hardware disagree?

−votes?

−error m

essage.

CS250 -- Part V

24D

r.Rajesh Subram

anyan, 2005

Page 25: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Distributed and C

luster Com

puters

•T

ightly coupled

−parallel hardw

are units are located inside the same

computer system

•L

oosely coupled

−m

ultiple computer system

s that are interconnected by acom

munication m

echanism

−e.g. distributed architectures, cluster com

puter

CS250 -- Part V

25D

r.Rajesh Subram

anyan, 2005

Page 26: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Distributed and C

luster Com

puters

•D

istributed architecture

−set of com

puters connected by a computer netw

ork orInternet

•C

luster computer

−set of independent com

puters connected by a high-speedcom

puter network that are all dedicated to solving one

problem at a tim

e

−also called netw

ork cluster

CS250 -- Part V

26D

r.Rajesh Subram

anyan, 2005

Page 27: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Summ

ary

•Parallelism

is one of the fundamental optim

ization techniquesused to increase hardw

are performance

•E

xplicit parallelism gives

aprogram

mer control over

the useof parallel facilities; im

plicit parallelism handles parallelism

automatically

•SIM

D architecture allow

s an instruction to operate on anarray of values

−vector processors

−graphic processors

CS250 -- Part V

27D

r.Rajesh Subram

anyan, 2005

Page 28: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

Summ

ary

•M

IMD

employs m

ultiple, independent processors thatoperate sim

ultaneously and can each execute a separateprogram

.

−sym

metric m

ultiprocessors

−asym

metric m

ultiprocessors

•Speedup

−theoretically m

achine with N

processors should performN

times as fast as a single processor

−in

practice, speedup does not increase linearly with

number of processors due to m

emory contention and

comm

unication overhead.

•A

lternativesto

SIMD

and MIM

D are redundant, distributed,

and cluster architecturesC

S250 -- Part V28

Dr.R

ajesh Subramanyan, 2005

Page 29: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

CS250 -- Part V

29D

r.Rajesh Subram

anyan, 2005

Page 30: Pa Adv - eca.cs.purdue.edu · update 16 bytes of an image at the same time •C oarse-grain −p arallelism on the le ve lo fp rograms or lar ge blocks of data −e.g. dual processor,u

CS250 -- Part V

29D

r.Rajesh Subram

anyan, 2005