local performance

65
EPFL CS-206 – Spring 2015 Lec.3 - 1 CS-206 Concurrency Lecture 3 Performance & Efficiency Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Adapted from slides originally developed by Maurice Herlihy and Nir Shavit from the Art of Multiprocessor Programming, and Babak Falsafi EPFL Copyright 2015 cache Bus Bus cache cache 1 shared counter shared memory void primePrint { int i = ThreadID.get(); // IDs in {0..9} for (j = i*10 9 +1, j<(i+1)*10 9 ; j++) { if (isPrime(j)) print(j); } } code Local variables

Upload: others

Post on 31-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

EPFL CS-206 – Spring 2015 Lec.3 - 1

CS-206 Concurrency���Lecture 3Performance &EfficiencySpring 2015Prof. Babak Falsafiparsa.epfl.ch/courses/cs206/

Adapted from slides originally developed by Maurice Herlihy and Nir Shavit from the Art of Multiprocessor Programming, and Babak FalsafiEPFL Copyright 2015

cache

BusBus

cachecache

1

shared counter

shared memory

void primePrint { int i = ThreadID.get(); // IDs in {0..9} for (j = i*109+1, j<(i+1)*109; j++) { if (isPrime(j)) print(j); }}

code

Local variables

EPFL CS-206 – Spring 2015 Lec.3 - 2

Lecture& Lab

M T W T F16-Feb 17-Feb 18-Feb 19-Feb 20-Feb23-Feb 24-Feb 25-Feb 26-Feb 27-Feb2-Mar 3-Mar 4-Mar 5-Mar 6-Mar9-Mar 10-Mar 11-Mar 12-Mar 13-Mar16-Mar 17-Mar 18-Mar 19-Mar 20-Mar23-Mar 24-Mar 25-Mar 26-Mar 27-Mar30-Mar 31-Mar 1-Apr 2-Apr 3-Apr6-Apr 7-Apr 8-Apr 9-Apr 10-Apr13-Apr 14-Apr 15-Apr 16-Apr 17-Apr20-Apr 21-Apr 22-Apr 23-Apr 24-Apr27-Apr 28-Apr 29-Apr 30-Apr 1-May4-May 5-May 6-May 7-May 8-May11-May 12-May 13-May 14-May 15-May18-May 19-May 20-May 21-May 22-May25-May 26-May 27-May 28-May 29-May

Where are We?

u Parallelismw Parallel thinkingw Division of workw Performance

u  Next weekw Mutual Exclusion

EPFL CS-206 – Spring 2015 Lec.3 - 3

To design well-balanced parallel software we need tothink about how a problem can be solved in parallel,

divide the work evenly among threads,maximize the parallelism and reduce overhead

EPFL CS-206 – Spring 2015 Lec.3 - 4

Example Problem: Galactic Motion

u  Computing the mutual interactions of N bodiesw n-body problemsw  stars, planets, molecules…w modeled as a tree

u  Can approximate influence of distant bodiesu  How do we compute this problem in parallel?

EPFL CS-206 – Spring 2015 Lec.3 - 5

Example Problem: Ocean Simulation

u  Simulate ocean eddy currentsu  Discretize in space and time

w Modeled as grids of elements with velocityw Compute the impact of near neighbor elementsw  Iterate until a solution is reached

u  Used in weather prediction, climate science, …..

Ocean Basin

EPFL CS-206 – Spring 2015 Lec.3 - 6

Example Problem: Deep Learning

Used in search, machine translation, face recognition, investment banking, …..

EPFL CS-206 – Spring 2015 Lec.3 - 7

Technical distinction: Concurrency vs. Parallelism

u  Concurrency: two or more threads make forward progress together

u  Parallelism: two or more threads execute at the same timew All parallel threads are concurrent, but not vice versa

u  Roughly how many threads vs. how many cores

Thread 1 Thread 2 Thread 1

Thread 1

Thread 2

EPFL CS-206 – Spring 2015 Lec.3 - 8

Terminology

u  A Task is a piece of workw Ocean simulation: compute a grid point, row, planew Raytracing: one ray or group of rays

u  Task grainw  small è fewer operations (less work) per taskw  large è more operations (more work) per task

u  Threads performs tasksw Threads execute on processors

EPFL CS-206 – Spring 2015 Lec.3 - 9

Forms of Parallelismu  Throughput parallelism

w Perform many (identical) sequential tasks at the same timew E.g., Google search, ATM (bank) transactions

u  Functional or task parallelismw Perform tasks that are functionally different in parallelw E.g., iPhoto (face recognition with slide show)

u  Pipeline parallelismw Perform tasks that are different in a particular orderw E.g., speech (signal, phonemes, words, conversation)

u  Data parallelismw Perform the same task on different dataw E.g., Data analytics, image processing

}Re

duce

tim

e fo

r one

job

EPFL CS-206 – Spring 2015 Lec.3 - 10

Division of Work: It’s about Performance

u  Balance workloadw Give each parallel task the same rough amount of work

u  Reduce communicationw Balance computation time with communication timew Computation à useful work, Communication à overhead

u  Reduce extra workw Creating a thread, assigning workw Scheduling threads on processors, OS, etc.

u  These are at odds with each other

EPFL CS-206 – Spring 2015 Lec.3 - 11

Example: Division of Work

}

} Overhead

} {Load imbalance

Small tasksLarge tasks

EPFL CS-206 – Spring 2015 Lec.3 - 12

Matrix Multiplication

( ) ( ) ( )BAC •=

EPFL CS-206 – Spring 2015 Lec.3 - 13

c11 c12 … c1nc21 c22 … c2n cn1 cn2 … cnn

!

"

#####

$

%

&&&&&

=

a11 a12 … a1na21 a22 … a2n an1 an2 … ann

!

"

#####

$

%

&&&&&

×

b11 b12 … b1nb21 b22 … b2n bn1 bn2 … bnn

!

"

#####

$

%

&&&&&

Matrix Multiplication

Where cij = aik ⋅bkjk=1

n

Cn×n( ) = An×n( )× Bn×n( )

EPFL CS-206 – Spring 2015 Lec.3 - 14

c11 c12 … c1nc21 c22 … c2n cn1 cn2 … cnn

!

"

#####

$

%

&&&&&

=

a11 a12 … a1na21 a22 … a2n an1 an2 … ann

!

"

#####

$

%

&&&&&

×

b11 b12 … b1nb21 b22 … b2n bn1 bn2 … bnn

!

"

#####

$

%

&&&&&

Matrix Multiplication

For each i and j:Multiply the entries aik by the entries bkj for k = 1, 2, …, n and summing the results over k

Cn×n( ) = An×n( )× Bn×n( )

EPFL CS-206 – Spring 2015 Lec.3 - 15

class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}

Matrix Multiplication

15

B

A

C

n

i

col

row

i

n

n

n

EPFL CS-206 – Spring 2015 Lec.3 - 16

Matrix Multiplication

class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}

a thread

EPFL CS-206 – Spring 2015 Lec.3 - 17

Matrix Multiplication

class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}

Which matrix entry to compute

EPFL CS-206 – Spring 2015 Lec.3 - 18

Matrix Multiplication

class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}

Actual computation

EPFL CS-206 – Spring 2015 Lec.3 - 19

Matrix Multiplication

void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

EPFL CS-206 – Spring 2015 Lec.3 - 20

Matrix Multiplication

void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

Create n x n threads

EPFL CS-206 – Spring 2015 Lec.3 - 21

Matrix Multiplication

void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

Start them

EPFL CS-206 – Spring 2015 Lec.3 - 22

Matrix Multiplication

void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

Wait for them to

finish

Start them

EPFL CS-206 – Spring 2015 Lec.3 - 23

Matrix Multiplication

void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

Wait for them to

finish

Start them

What’s wrong with this picture?

EPFL CS-206 – Spring 2015 Lec.3 - 24

Thread Overhead

u  One thread per taskw One dot product task

u  Threads Require resourcesw State:

w Memory for stacksw A copy of the register filew Program state: Program Counter, Stack pointer,….

w Setup, teardownw Scheduler overhead

u  Short-lived threadsw Bad ratio of work versus overhead

EPFL CS-206 – Spring 2015 Lec.3 - 25

One More “Big” Performance-Related Axiomu Amdahl’s Law

w  In English: if you speed up only a small fraction of the execution time of a program or a computation, the speedup you achieve on the whole application is limited

u  Example10 s 90 s

1 s 90 s

A 10x���speedup���on this part!

100s

91s

EPFL CS-206 – Spring 2015 Lec.3 - 26

Amdahl’s Law

Speedup = 1 Fractionenhanced + (1 - Fractionenhanced) Speedupenhanced

Parallel fraction

EPFL CS-206 – Spring 2015 Lec.3 - 27

Amdahl’s Law

Speedup = 1 Fractionenhanced + (1 - Fractionenhanced) Speedupenhanced

Sequentialfraction

EPFL CS-206 – Spring 2015 Lec.3 - 28

Amdahl’s Law

Simple example:Program runs for 100 seconds on a uniprocessor10% of the program can be parallelized on a multiprocessorAssume an ideal multiprocessor with 10 processors

Speedup = 1 Fractionenhanced + (1 - Fractionenhanced) Speedupenhanced

Speedup = 1 = 1 = 1 = 1.1 0.1 + (1-0.1) 0.01 + 0.9 0.91 10

EPFL CS-206 – Spring 2015 Lec.3 - 29

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1 2 4 8 16 32 64 128

Tim

e

# of Processors

10% 50% 70% 80% 90%

Implications of Amdahl’s Law

Fractionenhanced

EPFL CS-206 – Spring 2015 Lec.3 - 30

Parallel execution is not ideal

u  10 processors rarely get a speedup of 10w Load imbalancew Thread start/join overheadw Communication overhead

u  Even if Fractionenhanced is close to 100%w Speedupenhanced << p for p processorsw Our goal is to get it as close as possible to p

EPFL CS-206 – Spring 2015 Lec.3 - 31

Amdahl’s Law (in practice)

EPFL CS-206 – Spring 2015 Lec.3 - 32

Back to Matrix Multiplication

void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

Sequential(1 – Fractionenhanced)

EPFL CS-206 – Spring 2015 Lec.3 - 33

Matrix Multiplication

class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}

Parallel(Fractionenhanced)

EPFL CS-206 – Spring 2015 Lec.3 - 34

Example: n = 16

u  How many threads will our Matrix Multiplication create?

u  How many of these threads are concurrent (i.e., degree of concurrency)?

EPFL CS-206 – Spring 2015 Lec.3 - 35

Example: Assume there are 4 processors

u  How many threads per processor?

u  How many threads run in parallel (i.e., degree of parallelism)?

EPFL CS-206 – Spring 2015 Lec.3 - 36

Best efficiency: Concurrency ~ Parallelismu  All work is independentu  Max parallelism? 4u  Only need 4 threads

w Reduce thread.start(), thread.join() overheadw Only 4 start() and 4 join()w Workers (each thread) should do more workw 16x16 dot products divided by 4 = 64 dot products per thread

Rows 0-3 4-7 8-11 12-15

EPFL CS-206 – Spring 2015 Lec.3 - 37

Matrix Multiplication on “p” cores

void multiply() { BigWorker[] worker = new BigWorker[n]; for (int row=0; row < n; row+=n/p) worker[row] = new BigWorker(row); for (int row=0; row < n; row+=n/p)

worker[row].start(); for (int row=0; row < n; row+=n/p)

worker[row].join();}

EPFL CS-206 – Spring 2015 Lec.3 - 38

Matrix Multiplication: (n/c) x n per worker

void multiply() { BigWorker[] worker = new BigWorker[n]; for (int row=0; row < n; row+=n/p) worker[row] = new Worker(row); for (int row=0; row < n; row+=n/p)

worker[row].start(); for (int row=0; row < n; row+=n/p)

worker[row].join();}

Create p threads

EPFL CS-206 – Spring 2015 Lec.3 - 39

BigWorker: Each thread does (n/p) x n class BigWorker extends Thread { int begin_row; BigWorker(int begin_row) { this.begin_row = begin_row; } public void run() { double dotProduct = 0.0; for (int row=begin_row; row < begin_row+n/p; row++) for (int col=0; col < n; col++) for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}

EPFL CS-206 – Spring 2015 Lec.3 - 40

BigWorker: Each thread does (n/p) x n class Worker extends Thread { int begin_row; BigWorker(int begin_row) { this.begin_row = begin_row; } public void run() { double dotProduct = 0.0; for (int row=begin_row; row < begin_row+n/p; row++) for (int col=0; col < n; col++) for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}

Multiple rowsAll columns

EPFL CS-206 – Spring 2015 Lec.3 - 41

What if n is not divisible by p?

u  Lets pick n = 20 and p = 16w Matrices of 20x20 running on 16 cores

EPFL CS-206 – Spring 2015 Lec.3 - 42

Parallel Primality Testing

u  Challengew Print primes from 1 to 1010

u  Givenw Ten-processor multiprocessorw One thread per processor

u  Goalw Get ten-fold speedup (or close)

EPFL CS-206 – Spring 2015 Lec.3 - 43

Load Balancing

u  Split the work evenlyu  Each thread tests range of 109

…109 10102·1091

P0 P1 P9

EPFL CS-206 – Spring 2015 Lec.3 - 44

Procedure for Thread i

void primePrint { int i = ThreadID.get(); // IDs in {0..9} for (j = i*109+1, j<(i+1)*109; j++) { if (isPrime(j)) print(j); } }

EPFL CS-206 – Spring 2015 Lec.3 - 45

Issues

u  Higher ranges have fewer primesu  Yet larger numbers harder to testu  Thread workloads

w Unevenw Hard to predict

EPFL CS-206 – Spring 2015 Lec.3 - 46

Issues

u  Higher ranges have fewer primesu  Yet larger numbers harder to testu  Thread workloads

w Unevenw Hard to predict

u  Need dynamic load balancing

EPFL CS-206 – Spring 2015 Lec.3 - 47

17

18

19

Shared Counter

each thread takes a number

EPFL CS-206 – Spring 2015 Lec.3 - 48

Procedure for Thread i

int counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } }

EPFL CS-206 – Spring 2015 Lec.3 - 49

Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } }

Procedure for Thread i

Shared counterobject

EPFL CS-206 – Spring 2015 Lec.3 - 50

Where Things Reside

cache

BusBus

cachecache

1

shared counter

shared memory

void primePrint { int i = ThreadID.get(); // IDs in {0..9} for (j = i*109+1, j<(i+1)*109; j++) { if (isPrime(j)) print(j); }}

code

Local variables

EPFL CS-206 – Spring 2015 Lec.3 - 51

Procedure for Thread i

Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } }

Stop when every value taken

EPFL CS-206 – Spring 2015 Lec.3 - 52

Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } }

Procedure for Thread i

Increment & return each new value

EPFL CS-206 – Spring 2015 Lec.3 - 53

Counter Implementation

public class Counter { private long value; public long getAndIncrement() { return value++; } }

EPFL CS-206 – Spring 2015 Lec.3 - 54

Counter Implementation

public class Counter { private long value; public long getAndIncrement() { return value++; } } OK for single thread,

not for concurrent threads

EPFL CS-206 – Spring 2015 Lec.3 - 55

What It Means

public class Counter { private long value; public long getAndIncrement() { return value++; } }

EPFL CS-206 – Spring 2015 Lec.3 - 56

What It Means

public class Counter { private long value; public long getAndIncrement() { return value++; } }

temp = value; value = temp + 1; return temp;

EPFL CS-206 – Spring 2015 Lec.3 - 57

time

Not so good…

Value… 1

read 1

read 1

write 2

read 2

write 3

write 2

2 3 2

EPFL CS-206 – Spring 2015 Lec.3 - 58

Is this problem inherent?

If we could only glue reads and writes together…

read

write read

write!! !!

EPFL CS-206 – Spring 2015 Lec.3 - 59

Challenge

public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } }

EPFL CS-206 – Spring 2015 Lec.3 - 60

Challenge

public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } }

Make these steps atomic (indivisible)

EPFL CS-206 – Spring 2015 Lec.3 - 61

Hardware Solution

public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } } ReadModifyWrite()

instruction

EPFL CS-206 – Spring 2015 Lec.3 - 62

An Aside: Java™

public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; } }

EPFL CS-206 – Spring 2015 Lec.3 - 63

An Aside: Java™

public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; } }

Synchronized block

EPFL CS-206 – Spring 2015 Lec.3 - 64

An Aside: Java™

public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; } }

Mutual Exclusion

EPFL CS-206 – Spring 2015 Lec.3 - 65

Summary

u  Need to think parallelw Division of workw Lots of bottlenecks

u  Don’t forget Amdahl’s Lawu  Next week, Mutual Exclusion