understanding parallelism and the limitations of parallel

16
Understanding Parallelism and the Limitations of Parallel Computing

Upload: others

Post on 18-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding Parallelism and the Limitations of Parallel

Understanding Parallelism and the

Limitations of Parallel Computing

Page 2: Understanding Parallelism and the Limitations of Parallel

Scalability Laws 2

Understanding Parallelism: Sequential work

After 16 time steps: 4 cars

(c) RRZE 2013

Page 3: Understanding Parallelism and the Limitations of Parallel

Scalability Laws 3

Understanding Parallelism: Parallel work

After 4 time steps: 4 cars

“perfect speedup”

(c) RRZE 2013

Page 4: Understanding Parallelism and the Limitations of Parallel

Understanding parallelism:

Shared resources, imbalance

shared resource

Waiting for shared

resource

Unused resources due to resource

bottleneck and imbalance!

Waiting for

synchronization

(c) RRZE 2013 Scalability Laws 4

Page 5: Understanding Parallelism and the Limitations of Parallel

Scalability Laws 5

Limitations of Parallel Computing: Amdahl's Law

serial serial

serial serial

seriell serial serial

Ideal world:

All work is perfectly parallelizable

Closer to reality:

Purely serial parts

limit maximum speedup

Reality is even worse:

Communication and synchronization

impede scalability even further

(c) RRZE 2013

Page 6: Understanding Parallelism and the Limitations of Parallel

Scalability Laws 6

Limitations of Parallel Computing: Calculating Speedup in a Simple Model (“strong scaling”)

T(1) = s+p = serial compute time (=1)

purely serial

part s parallelizable part: p = 1-s

Parallel execution time: T(N) = s+p/N

General formula for speedup:

Amdahl's Law (1967)

“strong scaling” N

s

k

psNT

TS

1

1

)(

)1(

(c) RRZE 2013

Page 7: Understanding Parallelism and the Limitations of Parallel

Scalability Laws 7

Limitations of Parallel Computing: Amdahl's Law (“strong scaling”)

Reality: No task is perfectly parallelizable

Shared resources have to be used serially

Task interdependencies must be accounted for

Communication overhead (but that can be modeled separately)

Benefit of parallelization may be strongly limited

"Side effect": limited scalability leads to inefficient use of resources

Metric: Parallel Efficiency (what percentage of the workers/processors is efficiently used):

Amdahl case:

N

NSN

p

p

)()(

1)1(

1

Nsp

(c) RRZE 2013

Page 8: Understanding Parallelism and the Limitations of Parallel

Scalability Laws 8

Limitations of Parallel Computing: Adding a simple communication model for strong scaling

T(1) = s+p = serial compute time

purely serial

part s parallelizable part: p = 1-s

fraction k for

communication per

worker

parallel: T(N) = s+p/N+Nk

General formula for speedup:

NksNT

TS

Ns

k

p

1

1

)(

)1(

(c) RRZE 2013

Model assumption: non-

overlapping communication

messages

Page 9: Understanding Parallelism and the Limitations of Parallel

Scalability Laws 9

Limitations of Parallel Computing: Amdahl's Law (“strong scaling”)

Large N limits

at k=0, Amdahl's Law

predicts

at k≠0, our simple model

of communication

overhead yields a

beaviour of

sNS p

N

1)(lim 0

independent of N !

Problems in real world programming

Load imbalance

Shared resources have to be used serially (e.g. IO)

Task interdependencies must be accounted for

Communication overhead

NkNS Nk

p

1)( 1

(c) RRZE 2013

Page 10: Understanding Parallelism and the Limitations of Parallel

Scalability Laws 10

Limitations of Parallel Computing: Amdahl´s Law (“strong scaling”) + comm. model

0

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

# CPUs

S(N

) s=0.01

s=0.1

s=0.1, k=0.05

(c) RRZE 2013

Page 11: Understanding Parallelism and the Limitations of Parallel

Scalability Laws 11

Limitations of Parallel Computing: Amdahl´s Law (“strong scaling”)

0

10

20

30

40

50

60

70

80

90

100

1 10 100 500 1000

# CPUs

S(N

) s=0.01

s=0.1

s=0.1, k=0.05

Parallel

efficiency:

<10%

~50%

(c) RRZE 2013

Page 12: Understanding Parallelism and the Limitations of Parallel

Scalability Laws 12

Limitations of Parallel Computing: How to mitigate overheads

Communication is not necessarily purely serial

Non-blocking crossbar networks can transfer many messages

concurrently – factor Nk in denominator becomes k (technical

measure)

Sometimes, communication can be overlapped with useful work

(implementation, algorithm):

Communication overhead may show a more fortunate behavior

than Nk

"superlinear speedups“: data size per CPU decreases with

increasing CPU count may fit into cache at large CPU counts

(c) RRZE 2013

Page 13: Understanding Parallelism and the Limitations of Parallel

(c) RRZE 2013 Scalability Laws 13

Limits of Scalability:

Serial & Parallel fraction

Serial fraction s may depend on

Program / algorithm

Non-parallelizable part, e.g. recursive data setup

Non-parallelizable IO, e.g. reading input data

Communication structure

Load balancing (assumed so far: perfect balanced)

Computer hardware

Processor: Cache effects & memory bandwidth effects

Parallel Library; Network capabilities; Parallel IO

Determine s "experimentally":

Measure speedup and fit data to Amdahl’s law – but that could

be more complicated than it seems…

Page 14: Understanding Parallelism and the Limitations of Parallel

(c) RRZE 2013 Scalability Laws 14

Scalability data on modern multi-core systems

An example

12 cores on

socket

12 sockets

on node

Scaling

across nodes

P C

Chipset

Memory

P C

C

P C

P C

C

P C

Chipset

Memory

P C

C

P C

P C

C

P C

Chipset

Memory

P C

C

P C

P C

C

P C

Chipset

Memory

P C

C

P C

P C

C

Page 15: Understanding Parallelism and the Limitations of Parallel

Scalability data on modern multi-core systems

The scaling baseline

Scalability presentations should be

grouped according to the

largest unit that the

scaling is based on

(the “scaling baseline”)

(c) RRZE 2013 Scalability Laws 15

Amdahl model with

communication: Fit

to inter-node scalability

numbers

(N = # nodes, >1)

kNsNS

Ns

1

1)(

0,0

0,5

1,0

1,5

2,0

Sp

eed

up

1 2 4

# CPUs

intranode

memory-

bound

code! Good

scaling

across

sockets

internode

Page 16: Understanding Parallelism and the Limitations of Parallel

Application to “accelerated computing”

SIMD, GPUs, Cell SPEs, FPGAs, just any optimization…

Assume overall (serial, un-accelerated) runtime to be

Ts=s+p=1

Assume p can be accelerated and run α times faster. We

neglect any additional cost (communication…)

To get a speedup of rα, how small must s be? Solve for s:

At α=10 and r =0.9 (for an overall speedup of 9), we get

s≈0.012, i.e. you must accelerate 98.8% of serial runtime!

Limited memory on accelerators may limit the achievable

speedup

(c) RRZE 2013 Scalability Laws 16

1

1

1

1 1

r

ss

s

r